Project Proposal

Creative Commons License
Successful proposal for NeCTAR Virtual Lab funding: Above and Beyond Speech, Language and Music: A Virtual Lab for Human Communication Science (HCSvLab) by The University of Western Sydney is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License.

[This is the successful proposal document for the Virtual
Lab for Human Communication Science (HCS vLab) with some financial at personnel
details redacted, we’re putting it here as an introduction to the project and
to kick-off the project blog – Peter Sefton 2012-12-13]

D2 Response

Section A. Header Details

A.1 Program and Proposal Title

Program: Virtual Laboratory

Title: Above and Beyond Speech, Language and
Music: A Virtual Lab for Human Communication Science (HCS vLab)

A.2 Proposer

Organisation Name

University of Western Sydney

Contact Name

Professor Denis Burnham


Director, Marcs Institute

Business Address

University of Western Sydney

Postal Address

Locked Bag 1797, Penrith South DC NSW 2751


(02) 9772 6681


(02) 9772 6040

Mobile Phone


A.2.2 Participating Organisations

Organisation / Group Name



  • Project

  • Project
    Governance and Reporting

  • Business

  • User

  • System

  • Requirements
    and User-testing

  • eResearch
    Unit, UWS

  • System Administration Communications

  • ITS, UWS

  • Application maintenance and further development
    after the project proper has completed.

  • Library, UWS

  • Research
    Data Catalogue.

  • Intersect

  • Software

  • Project

  • Hosting
    and Support

  • RMIT

  • TBC

  • Macquarie

    University of

    National University

    University of
    Western Australia

    University of

    University of
    New England


    NC Inc.

  • TBC


  • TBC

  • University of
    New South Wales

    University of

    University of
    La Trobe

    University of


  • TBC

  • A.2.3 Project Funding Summary

    EIF Funds Requested

    Total Co-investment Offered

    for public version]

    for public version]

    for public version]

    Section B. Proposal Summary

    B.1      Executive

    Australia plays a strong and prominent international role in Human
    Communication Science

    encompassing research in speech science, speech technology, computer
    science, language technology, behavioural science, linguistics, music science,
    phonetics, phonology, and sonics and acoustics. However, this position is in

    Research Effort and Expertise:
    In Australia we have excellent
    researchers analysing their own
    corpora of data using their own analysis
    tools in relative isolation. Yes,
    these researchers meet and share their knowledge, at national/international
    conferences but (a) relatively infrequently and (b) in discipline-centred

    and Inefficiency:
    Research conducted in isolation entails local
    unshared mark-up or augmentation of local data sets, and inefficient repetition
    of search, information retrieval, annotation, and analysis using tools that are
    usually home-grown, inaccessible (e.g., idiosyncratic command line execution)
    and unsupported.

    In order to keep abreast of the modern pace
    of research, to leverage the available yet unrealised interdisciplinary
    challenges and opportunities, to go beyond the isolated
    desk-PC-lab-university-bound model of research – a quantum leap in the way in which we do our Human Communication
    Science is absolutely essential. We must move into a research environment that
    will eradicate the waste involved in repeated unshared analyses; ignite the
    research spark that affords the serendipity of new tool-corpus combinations;
    and dramatically improve scientific replicability by moving local and
    idiosyncratic desktop-based tools and data to an easy access, in-the-cloud,
    public, replicable environment that standardises, defines, and captures
    procedures and data output (see, e.g., We
    must connect HCS researchers, and their desks, computers, labs, and
    universities in order to build upon the achievements so far, produce emergent
    research knowledge, and instil this approach in our new interdisciplinary PhD

    Fortunately, but not without a good deal of
    forward planning, the common ground for
    such a Virtual Lab has been carefully prepared
    . The 2004-2009
    ARC-funded UWS-administered Human Communication Science Network (HCSNet) identified
    a community of over 1000 Australian Human Communication Science (HCS)
    researchers, engaged that community in over 60 different workshops, seminars
    and conferences, integrated that community with like international communities,
    and augmented
    the success rate of Australian HCS grant applications, including
    significant multi-site, interdisciplinary projects such as the ‘Thinking Head’
    and the ‘Big Australian Speech Corpus’. The focus that binds this
    community is the manner in which humans communicate with each other and with
    computers and machines via codified means — speech and text, music and

    The HCSNet experience made
    it clear that there is a thriving HCSNet community with vast potential for cross-disciplinary
    research connectivity (i) to provide new insights into old problems by
    approaching them from different disciplinary perspectives or with a hitherto
    untried method, and (ii) to apply novel combinations of old ideas or methods of
    analysis from different disciplines to new problems (see for instance a
    compendium of 30 HCSNet research papers, Dale, Burnham & Stevens, 2011). On
    the other hand, the HCSNet experience has also made it abundantly clear that
    one of the main impediments to the quantum leap that is required for HCS
    research to bloom is the difficulty for a researcher from one discipline to
    apply the tools and techniques of another discipline, or to explore data
    collected under one paradigm via a completely different analytical perspective.

    created the intellectual space
    between universities, disciplines, paradigms
    and methods for HCS research to flourish. Above
    and Beyond HCSNet, the ‘HCS vLab’ will create the virtual space
    infrastructure that enables easy access to shared tools and data, and overcomes
    the resource limitations of individual desktops. It will be a virtual
    laboratory that makes it possible for HCS researchers from a diverse range of
    disciplines to access an amalgamation of existing data sets (corpora)
    and selected analytical tools collected and generated from their midst and
    then, in this project, put into the cloud. Most
    importantly, it will enable the guided use of workflow tools and options
    to allow researchers to cross disciplinary boundaries.

    Consider the Music Researcher who wishes to
    analyse auditory-visual cues that Indonesian singers provide to elicit
    particular emotions in their audience. Our researcher has in mind some
    instances of particular songs, but then where to? How can auditory-visual
    records of appropriate songs be found, the words transcribed, and the phonetic
    nuances annotated and aligned with the auditory-visual record and the emotion
    in the face at any particular time? In HCS vLab a consolidated cross-corpus search
    would produce candidate songs (probably mainly from the PARADISEC corpus in this case), the text from use of a
    transcription tool such as Transcriber analysed
    with the ParGram grammar for
    Indonesian, the emotions information from the DeMobLib tool, with EMU affording
    a prosodic analysis (usually reserved for speech) of the songs. Without HCS
    vLab our researcher would have had to access PARADISEC, run a search, look around for the tools to analyse it,
    and probably be unaware of the range of tools that would be of use. With HCS
    vLab our researcher has searched a complex of corpora, and been guided through
    a complex of analytical tools and provided with a rich multimodal description
    of relevant songs. Moreover, the resultant metadata can be stored, as can the
    sequence of operations, and later be applied to other songs; the HCS vLab will
    house an ever-expanding enriched data set that will endure, not only for this
    researcher but for any of the national and international HCS researchers to use
    and augment. This example demonstrates various strengths of the vLab –
    multimodal, cross-cultural data and analyses, affective and lexical material,
    and focus on unique features of the Asia-Pacific region. The same process can
    be applied to any of the data available through the HCS vLab, with similar or
    quite different emphases. While this example is somewhat esoteric, more
    mainstream applications include research supporting automatic speech
    recognition (taxi ordering, directory assistance etc.), hearing aids and
    cochlear implants, interactive learning programs for children with learning
    disabilities, automatic melody recognition, forensic determination of origin
    and background of particular accents, computer-based information retrieval
    based on music, speech, sounds, or visual patterns, and psycholinguistic
    studies of second language learning and pedagogy.

    The HCS vLab framework will be built by
    Intersect, a leading eResearch organisation in NSW with a proven track record
    in building research software infrastructure. Intersect will be responsible for
    the technical design of the vLab and will manage the software development
    lifecycle. The Intersect Project Manager will report to the HCS vLab Project
    Manager; the project will be governed by the HCS vLab Steering Committee
    consisting of a blend of senior stakeholders from partner organisations along
    with a number of technical advisors. The vLab will be delivered in 3
    development releases over 14 months, and will incorporate 11 tools and 7

    HCS vLab will be built upon the ground prepared by HCSNet and the raw
    cross-disciplinary potential that HCSNet seeded. Corpora have been collected
    and tools have been developed by a range of technically-skilled HCS
    researchers, and each begs for a common ground, but due to the local needs of
    individual researchers and the impossibility of one lab or even one institution
    providing the required person- and computational-power these corpora and tools
    are isolated on PCs and servers around the country.                         HCS
    vLab will:

  • integrate 7 existing
    e.g., the 100TB ClueWeb corpus; the 3000-hour ARC-LIEF-funded AusTalk
    audio-visual speech corpus, the ANDS-funded Australian National Corpus (text
    & speech), the PARADISEC corpus of speech, text & song in indigenous
    & endangered languages, and the AMC corpus of Australian music.

  • integrate 11 existing tools, e.g., Natural Language
    Tool Kit (NLTK) for text analytics; EMU for speech analysis and interactive
    waveform labelling; PsySound3 with physical and psychoacoustic algorithms for
    sound & music analysis; Johnson-Charniak parsers to generate parse trees;
    & DeMoLib, for lip-tracking video analysis.

    These wide-ranging
    integrations are only possible because of the cohesion of a >1200-strong HCS
    research community who eagerly await the development of a vLab that will allow
    full realisation of the collaborative power of HCS research; a community that
    includes: Bruce Croft & Mark Sanderson (RMIT), top Information Retrieval
    (IR) researchers in the world; Denis Burnham & Catherine Stevens (UWS,
    speech & music experts, respectively) leaders on large ARC projects –
    HCSNet, the Thinking Head, and for Burnham, the Austalk corpus, Janet Fletcher,
    Andy Butcher & MarijaTabain (UniMelb, Flinders, LaTrobe), world experts in
    Australian indigenous languages; Steve Cassidy, Mark Johnson, Robert Dale
    (Macq), and Steven Bird (UniMelb) who are at the hub of language technology
    research in Australia; and Michael Wagner (UC), Roberto Togneri, Mohammed
    Bennamoun (UWA) & young stars Trent Lewis (Flinders) & Roland Göcke
    (UC), who are pioneering AV face and voice analysis.
    Over and above the realisation of a heuristic HCS research environment, HCS vLab will be:

    • accessible:  the variety of HCS tools will be
      accessible to non-technical researchers via workflow tools, stored protocols,
      and interactive GUIs, while retaining capacity for more sophisticated analyses.

    • interoperable:
       the generic HCS vLab infrastructural
      support will allow incorporation of HCS corpora from various platforms (e.g.,
      Windows, Mac OS, Linux), and interoperability with other major systems
      at home, e.g., the already-funded
      NeCTAR Humanities National Infrastructure (HuNI) vLab, and
      internationally by virtue of our Product Owner’s intimate domain knowledge and
      wide collaborations.

    • sustainable:  13 universities, 3 organisations, and 47
      key investigators have provided $423K in cash and $1.9M in-kind (1.7 times >
      the request to NeCTAR; with 5 Gold, 5 Silver, 4 Bronze
      & 2 Other members – see our Partner Model, Appendix E) with partner cash
      supporting sustained operational development and development of capabilities
      and reach including future plug-in of additional tools and corpora, and
      specialist user support well beyond the formal conclusion of HCS vLab

    Without the HCS vLab the promise of HCSNet and the
    careful priming of the HCS community
    will have been in vain. The HCS vLab
    is the natural progression for a strong receptive and ambitious research
    community that is on the move and wants to keep moving. It will create the
    avenue by which HCS research can make a quantitative and qualitative leap to
    enhanced capability, collaboration and output, to travel well beyond the
    geographical confines of individual labs, and well above the disciplinary
    confines of speech, language, music, or sonics alone into an interdisciplinary
    and heuristic Human Communication Science cloud-space. 

    B.1.1      Summary
    for Public Release

    Applications in automatic speech recognition (taxi ordering, directory
    assistance etc.), hearing aids and cochlear implants, interactive learning
    programs for children with learning disabilities, automatic melody recognition,
    forensic determination of origin and background of particular accents,
    computer-based information retrieval based on music, speech, sounds, or visual
    patterns, psycholinguistic studies of second language learning and pedagogy all
    depend upon research in Human Communication Science (HCS). Human Communication Science encompasses
    the areas of speech science, speech technology, computer science, language
    technology, behavioural science, linguistics, music science, phonetics,
    phonology, and sonics and acoustics. In turn HCS research depends upon datasets
    (corpora) of speech, music, text, faces, sounds, and specialised tools by which
    to search, analyse and annotate these data. Australia boasts a strong and
    active community of HCS researchers who have developed a wealth of corpora and
    tools relevant to HCS research. However, these researchers tend to analyse their
    corpora of data using their own analysis
    tools in relative isolation. Yes,
    these researchers meet and share their knowledge, at national/international
    conferences but (a) relatively infrequently and (b) in discipline-centred

    While HCS research in Australia is
    blooming, especially due to the highly successful Australian Research Council
    funded HCS Network from 2005-2009, and related research projects, research conducted
    in isolation entails inefficient repetition of analysis of local data sets. HCS
    research in Australia, and successful further real-life applications, requires going
    beyond the isolated desk-PC-lab-university-bound model of research into a new research
    environment. Such an environment will eradicate the waste involved in repeated
    unshared analyses; ignite the research spark that affords the serendipity of
    new tool-corpus combinations; and dramatically improve scientific replicability
    by moving corpora and tools and the analyses conducted with these into an easy
    access, shared, in-the-cloud, public, replicable environment.

    The HCS virtual Laboratory (HCS vLab) will
    connect HCS researchers, their desks, computers, labs, and universities and so
    accelerate HCS research and produce emergent knowledge that comes from novel
    application of previously unshared tools to analyse previously difficult to
    access data sets. The HCS vLab infrastructure will overcome resource
    limitations of individual desktops; allow easy access to shared tools and data;
    and provide the guided use of workflow tools and options to allow researchers
    to cross disciplinary boundaries.

    The HCS vLab will be: 

    • accessible to non-technical
      researchers via workflow tools, stored protocols, and interactive GUIs, while
      retaining capacity for more sophisticated analyses; 

    • interoperable
      by incorporating HCS corpora from various platforms (e.g., Windows, Mac
      OS, Linux), and ensuring compatability with other with major systems in
      Australia and internationally by virtue of our Product Owner’s intimate domain
      knowledge and wide collaborations; and 

    • sustainable:  13 universities, 3 organisations, and 47
      key investigators have provided $423K in cash and $1.9M in-kind to support
      sustained operational development and development of capabilities and reach
      including future plug-in of additional tools and corpora, and specialist user
      support well beyond the formal conclusion of HCS vLab construction.

    The HCS vLab will create the avenue by which HCS
    research can make a quantitative and qualitative leap to enhanced capability,
    collaboration and output, to travel well beyond the geographical confines of
    individual labs, and well above the disciplinary confines of speech, language,
    music, or sonics alone into an interdisciplinary and heuristic Human
    Communication Science cloud-space. 

  • B.2      Research
    Community Profile

    HCS vLab will be built by and
    for the Human Communication Science research community. The Human Communication
    Science community was formally established with the creation of HCSNet, and HCS
    researchers are active in various professional organisations as well as in
    Universities and Research Organisations, as set out below.

    1. The Network for Human Communication Science (HCSNet)

    The HCSNet community is a
    broad-reaching interdisciplinary mix of researchers from right across
    Australia, whose research spans speech, language and music and sonics. The
    community banded together and obtained funds from the Australian Research Council (ARC) for an ARC Research
    Network (RN0460284, 2004-2009, $2M, see, a network that greatly
    facilitated research and research collaboration in the > 1200 community of
    HCS researchers. HCSNet aimed to build Australia’s reputation as a
    leader in communication science and technology via advances in its priority
    areas of ‘Speech’, ‘Effective Human-Computer Interfaces’, ‘Next Generation
    Search Technology’, ‘Human Communication Disorders’, and ‘Human and Machine
    Perception and Action’. This succeeded: of the 47 researchers on this Virtual
    Lab bid, all but the 5 researchers who were not in Australia at the time were
    HCSNet members; and as a result of collaborations formed and projects hatched
    in the > 60 HCSNet workshops and conferences during the life of HCSNet, the HCSNet community spawned and
    continues to incubate various multidisciplinary multi-institutional projects,
    such as the following:

    The Thinking Head (ARC/NH&MRC Special Initiatives, TS0669874,
    2006-2012, – Research on auditory-visual
    speech, dialog, speech and speaker recognition, human-machine interaction,
    avatar and robot development. 11 Chief Investigators (CIs) from 6 Australian
    universities; 4 Partner Investigators (PIs) from 3 international institutions.

    Forensic Voice Comparison (ARC Linkage, LP100200142, 2010-2013) –
    Making demonstrably valid and reliable forensic voice comparison a practical
    everyday reality in Australia. 3 UNSW CIs, partners in Spain and China,
    industry partners including Australian Federal Police (AFP), National Institute
    of Forensic Science Australia (NIFS), NSW and Queensland Police and others.

    The Big ASC
    (Australian Speech Corpus
    , ARC LIEF, 2010-2012, LE100100211, – Large (1000 speakers x
    3 hours speech) auditory-visual speech corpus from 17 sites across Australia.
    29 CIs, 11 Australian
    universities in every state of Australia, 1 international PI.

    DADA-HCS (ARC SRI e-Research Support,
    2005-2006, SR0567319) –
    Distributed Access & Data Annotation
    for Human Communication Sciences. 9 CIs, 5
    Australian universities and institutions.

    Other funded projects involving
    HCS researchers preceded HCSNet and set the scene for the growing zeitgeist in
    human communication science, e.g., See Hear! The Multimodal Recording and
    Analysis Facility
    – new interfaces for analysis of complex visual
    and auditory scenes, and creation of a research tool for sound and music
    analysis (ARC LIEF: LE0668448, 2006, 12 CIs from 4 Australian

    Such projects have cemented
    HCS cross-disciplinary links by focussing research effort and harnessing
    cross-disciplinary research expertise. The result is a mature Human Communication
    Science research community au fait with the approaches and strengths of
    other disciplines, but a community yearning for a vehicle to transport these
    approaches to new lands. 

    2. The Australasian Speech Science & Technology
    Association (ASSTA)

    ASSTA (est. 1988) is the peak
    speech science and technology body in Australasia, and the meeting ground for
    engineers, computer scientists, cognitive scientists, psycholinguists, language
    technologists, phoneticians, linguists, forensic speech scientists, and speech
    pathologists via instruments such as its biennial Australasian Speech Science
    & Technology conference and its international counterpart, Interspeech,
    which ASSTA hosted in 2008. ASSTA has financially backed HCS endeavours
    (HCSNet; Forensic Voice Comparison; the Big ASC above), and will financially
    and academically support the current proposal. Along with other professional
    bodies (see B.8), ASSTA supports HCS research in Australia via research and
    conference funding (especially for early career researchers).

    3. HCS Community members in Universities and Research

    University of Western Sydney (UWS): UWS, and in particular the
    Marcs Institute, has played a leading role in the establishment of the HCS community; it was the lead institution on
    the HCSNet, Thinking Head, See Hear!, and Big ASC projects (see above), and
    provided and continues to provide cash and infrastructure support for these and
    like projects. UWS is committed to eResearch, and is a strong supporter of
    interdisciplinarity and eResearch initiatives. UWS will provide $175K cash and
    $528K in-kind and ongoing support for the project, support that can only
    facilitate HCS vLab uptake among HCS researchers. UWS and Marcs Institute will
    continue to play a leading role in maintaining the HCS community both by its
    contribution to research and its lead role in this project and promoting the
    use of HCS vLab. Marcs Institute, elevated
    from a UWS Centre to one of its 4 Institutes in 2011, is led by the
    Project Lead on this bid, Prof Denis Burnham. Marcs comprises 51 researchers and 24 higher degree students, who
    conduct behavioural, neuroscience, and computational research on human-human
    and human-machine communication in normal, heightened and degraded contexts in
    5 programs: Speech & Language, Music Cognition & Action, Bioelectronics
    & Neuroscience, Multisensory Processing, and Human-Machine Interaction; and
    in ARC FOR codes areas 1701, 1702, 2004, 1904, 0801, 0906, and 0903. Marcs has
    current public funding of $6,664,807, comprising 6 ARC Discovery grants, 1
    Discovery Early Career Research Award grant, 1 ARC/NHMRC Special Research
    Initiative, and 1 ARC Linkage Infrastructure, Equipment and Facilities grant
    across a range of areas that will use and promote the HCS vLab in areas such as
    auditory-visual  speech and
    cognitive processing; speech perception, regional accents, reading and language
    acquisition; reverse engineering of the brain; acoustic factors in music
    perception; human-machine interaction; and corpus studies. Moreover, Marcs has established collaborations with
    researchers from psychology, linguistics, music, education, computer science,
    engineering, and various industry partners, in over 20 major research
    institutes and over 30 additional individuals in Australasia, North America,
    the UK, and Europe, which will further add to the reach of HCS vLab to the HCS

    Further Universities and Research Organisations: Macquarie
    University and the lead institution, UWS,
    have a long history of collaboration and project co-leadership. The convenor of
    HCSNet, Prof Robert Dale, directs the Macquarie Centre for
    Language Technology and was a senior investigator in the Thinking Head, Big ASC
    and DADA projects; A/Prof Felicity Cox and HCS vLab Product Owner, Associate
    Prof Steve Cassidy are major players in the Big ASC project. Together Macquarie
    is a major node of HCS research. Other universities in the project all have a
    history of involvement in HCS research through HCSNet, one or more of the above
    multi-disciplinary projects, and in their own HCS projects. These HCS community
    universities contribute to HCS research in the specialist areas as set out
    hereafter. ANU: Phonetics,
    Indigenous Languages; Canberra:
    Speech Forensics, AV Speech and Speaker Recognition; Flinders: Computer Science and AV speech; Melbourne: Engineering, Phonetics; Sydney: Indigenous Song, Music, Speech Pathology; Tasmania: Cognitive Science,
    Psychology; UNSW:
    Speech Science, Music, Emotion; UWA: Speech and Speaker
    Recognition; RMIT: Information
    Retrieval (IR) for very large databases, dialog and multi-agent systems,
    in AI and computer science; UNE:
    Psycholinguistics, regional languages, logic in child language; LaTrobe: Language diversity, minority
    languages, Australian indigenous languages, data-oriented and theory-oriented
    approaches; NICTA: Machine
    Learning and NLP.

    The wide geographical and disciplinary spread of the 14 partner research
    organisations across Australia, along with the active research profiles of the
    47 individual researchers in the bid, with their ongoing
    Higher Degree Research (HDR) student load and numerous significant national and international
    collaborations, provide a strong scaffold of support and uptake for this

    B.3      Development
    Organisation Profile

    1. Intersect Australia

    Australia Ltd is a not-for-profit company limited by guarantee, owned and
    funded by its members, the universities in NSW, state government departments,
    and other organisations undertaking research in NSW.

    has a strategic focus on national research infrastructure. Intersect is a
    member of The National Computational Infrastructure (NCI), and the Australian
    Access Federation (AAF). Intersect has undertaken and is undertaking many
    projects deploying data capture and management solutions for the Australian
    National Data Service (ANDS). The software Intersect develops integrates with
    infrastructure provided through these bodies.

    its establishment, Intersect has demonstrated that it is one of Australia’s
    leading eResearch organizations in having the capability and capacity for
    undertaking eResearch projects.

    Capacity and Capabilities
    • Intersect has approximately 50 staff. It has
      established a capacity and capability to develop, deploy and support
      substantial and complex eResearch infrastructure, that is unique in Australia.
      This capability is built on a company culture which emphasises a focus on the
      client and of engineering excellence. Intersect has built a team that delivers
      eResearch solutions on time, on budget and of value.

    • Intersect’s Engineering Division brings together
      many years of commercial experience in developing large scale IT systems across
      many sectors such as academia, government, banking and enterprise security
      tools. The team of 30 staff includes user interface designers, specialist test
      engineers, software and systems engineers and project managers.

    • Intersect’s Services Division staff have
      backgrounds in publically funded research, commercial research and development,
      and commercial information technology service provision. The team of eleven
      staff is responsible for outreach and engagement before, during and after
      development commences. They provide capability to carry out stakeholder
      management, requirements gathering, and product ownership.

    • The Operations Division, comprising five staff,
      has been centrally involved in systems integration projects and the
      transitioning of projects from development to commissioning and ongoing

    Track record and relevant experience

    • Intersect has undertaken and successfully
      delivered many eResearch projects (at the time of writing approximately 25),
      including projects with development and integration budgets in excess of $1
      million. These projects provide solutions and infrastructure to research
      efforts across a range of disciplines. These projects include:

    • analysis and integration projects for many NCRIS
      capability areas (e.g. AMMRF, PHRN, AAL) analysis and development projects for
      non-university research bodies (ANSTO, NSW Office of Environment and Heritage)

    • software development projects funded through PfC
      capabilities (e.g. 11 ANDS data capture projects for 4 universities) software
      development projects funded directly by our membership (e.g. Rainfall)
      strategic software development projects funded by Intersect (e.g. Genomic Data
      Repository, Australian Schizophrenia Research Bank)

    • In the vocabulary of NeCTAR, a number of these
      projects would fit within the parameters of eResearch tools (e.g. ANDS data
      capture projects, ASRB) or virtual laboratories (e.g. PHRNi). Intersect is
      currently delivering six ANDS-funded projects for its members.

    see the attached letters of support for Intersect’s track record as a
    development partner.

    Approach to quality standards

    has not sought formal certification under any standards (e.g. ISO 9000).
    Intersect follows a three part method for achieving quality:

    • “Say what you are going to do”. Starting from
      the concept stage of this project, and continuing throughout the project, Intersect
      keeps the customer informed of what they are doing and how they are doing it.
      Intersect has processes covering Consultation, Business Analysis, Project
      Management and Software Engineering. Intersect works with clients to tailor
      these to suit their needs and the project’s needs.

    • “Do what you said you were going to do”. Intersect
      follows their processes. If issues are encountered then Intersect talks to the
      customer to find agreeable solutions.

    • “Prove it”. Intersect keeps the smallest quality
      record possible, as documented in a quality management plan.

    plan is written during the elaboration stage of development (see Item B.15), in
    conjunction with the stakeholders and NeCTAR.

    Support and warranty mechanisms

    issues a formal 3-month warranty for all projects; the warranty commences after
    user-acceptance testing has completed. During this period all defects are fixed
    at no cost to the customer. Intersect’s on-going defect rate is less than one
    new defect discovered per month, across 25 deployed systems. In practice, the
    defect rate has been low enough that Intersect has fixed the majority of
    defects that come to light after the warranty period has expired. Additional
    feature requests are carried out on a fee-for-service basis.


    • RMIT. The Information
      Retrieval (IR) group in the School of Computer Science & IT at RMIT
      University is recognised as an international leader in the development of
      search engines: producing 28 A or A* journal/conference papers in the last ERA
      period. The ISAR group has extensive experience building open source search
      engines, creating the Zettair and MG systems, both widely used. In a separate
      strand of research, the group also evaluates search engine effectiveness. Based
      on citations to its last five years of publication outputs, the RMIT IR group
      was placed in the world’s top 15 IR research groups by Microsoft’s Academic
      Search (; and No. 1 in Australia. Three
      members of the RMIT group will contribute.

    • NICTA. NICTA is
      centre of excellence for ICT research, with over 600 researchers and
      students, including many world leaders in their fields. NICTA’s mission is to
      deliver outstanding ICT research outcomes, and to create wealth for Australia
      through the application of that research. NICTA performs research in a number
      of areas, including Machine Learning and Control and Signal Processing, which
      include outstanding researchers in Language Technology and Text Retrieval.
      NICTA’s Engineering and Technology Development team is expert at transitioning
      research into innovative technology solutions to real problems.

    • RMIT/NICTA Software Engineering. An important
      part of the HCS vLab’s toolkit will be the ability to assemble “processing
      pipelines” involving multiple tools processing the same data sequentially.
      Based on the combined expertise of the NICTA and the RMIT staff involved in
      this project, a Research Engineer will be employed via sub-contract to help the
      Intersect engineering team build a flexible component architecture for the HCS vLab
      compatible with UIMA, an emerging standard for wrapping components for
      processing language, speech, video, and other unstructured data.

    Operational Organisation Profile

    Australia Ltd is also the Operational Organisation.

    Capacity and capabilities

    provides hosting, operations, outreach, L&D and support for our members’
    and affiliates’ eResearch needs. Intersect hosts its production infrastructure
    with commercial hosting partners ‘ac3’. The hosting is located at the Global
    Switch data centre in Ultimo. Intersect hosting partners provide managed
    services, including backup, system monitoring and logging, and core network
    connectivity for all services. The Intersect systems administration team
    performs network management, user-level support and troubleshooting, and
    general systems administration. Intersect has an access agreement with AARNet;
    all systems hosted by Intersect are “on-net”. Issues with services hosted at
    ac3 are notified to the head of Intersect’s systems administration team.

    Operations, advocacy, L&D and support are built on a
    full-time team of ten eResearch Analysts, an HPC specialist, a data management
    specialist and five systems administrators. The team is currently responsible

    • Operations of and merit-based allocation to HPC
      facilities, both through Intersect’s own McLaren service, hosted at ac3, and
      Intersect’s partner share of NCI.

    • Hosting data and applications on behalf of Intersect
      members’ researchers. We host off-the-shelf applications, customized
      open-source applications, bespoke software developed by Intersect, data
      accessible via the DataFabric, and data management systems. Hosted systems
      include Confluence, Jira, OpenClinica and OpenCDMS.

    • Hosting hardware on behalf of Intersect clients,
      for example the SAX Institute.

    • Providing first-tier support for HPC users, as
      well as users of national services such as the ARCS Data-Fabric, ARCS Grid
      Computing Service, and AAF’s authentication services.

    team of Analysts has a background in publicly funded research, commercial research
    and development, and commercial information technology. They provide both
    one-off and ongoing assistance to research groups, and combine experience
    across a large number of research disciplines.

    system administration team brings together commercial- and research-based
    experience supporting and integrating administrative as well as research

    has applied to the RDSI NoDe program and has a plan to build and operate an
    RDSI node. Intersect has applied to the NeCTAR Research Cloud program, and has
    a plan to build and operate a research cloud node. Both services will operate
    along side our existing infrastructure, hosted in commercial hosting

    Track record and relevant experience

    • Intersect provides operational support for HPC,
      valued at approximately $1m annually, to its membership. Over the last three
      years, Intersect has provided support directly to more than 100 research
      projects comprising hundreds of researchers. In the last year Intersect
      provided specialist ongoing support, in the form of scripting, troubleshooting,
      compilation of software, and design of experiments, to approximately 30
      research projects.

    • Intersect hosts applications or application
      spaces for approximately 30 groups across seven institutions. Hosting is
      provided through a combination of in-house infrastructure and in partnership
      with ac3.

    • Systems administrators at Intersect have been
      involved in the roll-out and support of national services, through most of the
      PfC programs, including ARCS, the AAF and ANDS.

    • Intersect’s eResearch Analysts provide ongoing
      support comprising engagement and outreach, issue management, and learning and
      development to hundreds of researchers each year.

    see the attached letters of support for Intersect’s track record as an
    operating partner.

    Approach to quality standards

    has systems and procedures in place that provide quality control and assurance
    to their customers. The tools used in these systems and procedures include:
    ‘Jira’ to raise and track external and internal issues; ‘Nagios’ to automate
    service monitoring and raise alerts; ‘Cacti’ for usage trend analysis and
    reporting; ‘Splunk’ for log management, troubleshooting and forensics; and
    ‘Confluence’ to document and manage procedures and processes.

    There is an end-to-end process for raising and resolving
    support issues, including a process by which support issues are prioritized and

    Support and warranty mechanisms

    issues a formal 3-month warranty for all projects; the warranty commences after
    user-acceptance testing has completed. During this period all defects are fixed
    at no cost to the customer. In practice, the defect rate has been low enough
    that Intersect has fixed the majority of defects that come to light after the
    warranty period has expired. Additional feature requests are carried out on a
    fee-for-service basis.

    The systems administration team provides user support for
    hosted services. The systems administration team works with the customer to
    configure the initial setup of their service. This work includes making a
    decision on the appropriate hosting arrangements (e.g. local, Intersect,
    commercial) based on the required level of service.

    commissioning, issues are tracked using Jira, and issues are triaged into
    support and defect cases. Defects are escalated to the engineering team.
    Responsibility for each support issue is assigned to a case manager, who looks
    after the reporter until the issue is either resolved or escalated.

    All services are monitored automatically using Nagios.
    Outages and anomalies are reported to the systems administration team, and
    emails are sent to the owners of the service using mailman.

    Scheduled outages are negotiated with the customer as they
    are required e.g. for upgrade of hardware or maintenance releases of software.

    Services are promoted through Intersect’s website,
    newsletter, marketing collateral and outreach program. Outreach activities
    include promotion of services, including through an Intersect-maintained tools
    register (soon to be integrated with CAUDIT’s eResearch portal), as well as
    providing one-on-one assistance to researchers deciding on and using services
    in their research.

    Training is supported though Intersect’s learning and development
    (L&D) program. Existing L&D material has been developed for services
    including: interactive and self-paced training courses (covering e.g. HPC,
    Google’s REFINE); written material available through our website (e.g. guides
    to Evo, Collaborative Authoring); and an emerging program of web-casts
    demonstrating concepts and usage of tools (e.g. Subversion).

    Other Participants

    In addition to UWS (project leadership and management,
    access to the AusTalk (BigASC) (Burnham), and the AMC (Dean in association with
    the John Davis, CEO of the Australian
    Music Centre) databases, and the ParseEval (Shaw)
    tool, the following institutions or groups will be involved. Details of
    involvement and contribution are given in Part D3.

    • Intersect
      – Project development, business analysis, and co-investment.

    • Macquarie U – Specialists in Phonetics and in Language
      Technology. Access to the EMU, AusNC (Cassidy), and the Johnson-Charniak parser
      (Johnson) tools; adaptation of the audio aspects of AusTalk (Cox).

    • U of
      Melbourne – Specialists in Phonetics,
      Linguistics and Engineering. Adaptation and testing of the NLTK (Bird),
      PARADISEC tools and the speech component of the PARADISEC corpus (Thieberger).

    • Sydney
      U – Specialists in Music and Speech Pathology. Adaptation
      of the PARADISEC (Music aspects) corpus (Barwick); user testing and feedback
      (Arciuli), adaptation and testing of PsySound3 (Carbrera).

    • ANU
      – Specialists in Phonetics and Indigenous Languages. Adaptation of
      Indigenous languages and text aspects of PARADISEC corpus (Simpson), adaptation
      of the Indonesian corpus (Arka, Mistica), user testing and feedback (Ishihara).

    • Flinders
      U – Specialists in Computer Science and Auditory-Visual (AV) Speech. User
      testing and feedback (Powers, Lewis); adaptation of AV aspects of AusTalk and
      advice on AV aspects of the project (Lewis).

    • UNSW
      – Specialists in Speech Science, Music, Emotion in speech and music, and
      in Forensic Speech Science. User testing and feedback (Epps, Ambikairajah, Cabrera, Emery).

    • UWA – Specialists in Robust Speech, Speaker
      Recognition, and 3D audio-visual speech and speaker recognition. Adaptation and
      testing of the modifications to HTK (Togneri); Visual and 3D processing for
      recognition (Bennamoun); User
      testing and feedback, audio-visual feature
      processing advice (Togneri, Bennamoun).

    • U
      Canberra – Specialists in AV speech, Automatic Speech Recognition,
      Forensics. Adaptation of AVOZES corpus and of the DeMoLib liptracker tool
      (Goecke); User testing and feedback (Wagner, Goecke).

    • U Tasmania
      – Psycholinguistics and reading studies. User testing and feedback

    • RMIT
      – Information Retrieval and Natural Language Processing (Sanderson,

    • UNE – Psycholinguistics,
      regional languages, logic in child language. User testing and feedback (Khlentzos, and the Language and Cognition Research Centre).

    • LaTrobe – Language diversity, minority
      languages, data-oriented and theory-oriented approaches. Integration
      post-project of the VisLab for the remote use of scientific instruments
      and imaging of scientific data (Schembri); User testing and feedback (Tabain).

    • NICTA
      – Specialists in Machine
      Learning and Control and Signal Processing, including Language Technology and
      Text Retrieval. Support for interoperability with UIMA, the Unstructured Information Management Architecture, an
      emerging standard for wrapping components for processing language, speech,
      video, etc. data

    • AusNC
      Inc. – The Australian National Corpus – Adaptation
      of AusNC corpora and tools for HCS vLab; contribution of expertise on the licensing of corpora for online use and the
      core technical platform developed to ingest corpus data and meta-data into a
      unified online format (Haugh, Cassidy, Goddard).

    • ASSTA
      – the Australasian Speech Science and Technology Association – Peak
      body on speech science research in Australasia. Promotion of the HCS vLab
      through electronic bulletins, the ASSTA Newsletter and the biennial Speech
      Science and Technology (SST) conference; advice from speech science and technology
      experts as required; liaison and HCS vLab promotion with international
      counterpart, The International Speech Communication Association ISCA.   

    Key Personnel

    • Project
      , Professor Denis Burnham, Inaugural Director (1999- ) of Marcs
      Institute, UWS conducts research in behavioural and speech science with collaborators from music cognition,
      linguistics, phonetics, engineering, computer science and creative arts. He is
      President, Australasian Speech Science & Technology Association (ASSTA,
      2002- ); Member, ISCA (International Speech Communication Association)
      International Advisory Council and Interspeech Steering Committee; and Co-Founder, Auditory-Visual Speech Perception Association. Burnham has held over 30
      externally-funded grants (over 20 as Leader): he has led ARC Linkage projects
      with industry partners Australian Caption
      , and with Cochlear Ltd;
      and large interdisciplinary projects, e.g., the $2M 5-year ARC Research Network on Human Communication
      Science (HCSNet)
      (Triumvirate CI); and Leader of the $3.4M 5-year
      ARC and NH&MRC Special Initiative Thinking
      , the ARC $1M Big Australian
      Speech Corpus (Big ASC)
      , and most recently Seeds of Literacy, a 5-year $750K ARC Discovery.

    • Project
      , Dr Dominique Estival has a PhD in Linguistics and extensive experience
      in academic research and commercial project management for Language Processing.
      Following research, industry and academic positions in the USA, Europe and
      Australia, she took up Project Management roles: Team Leader, R&D, Syrinx
      Speech Systems, a Sydney speech recognition company developing automated
      telephone dialogue systems; Senior Research Scientist, Natural Language technologies,
      human-computer interfaces and multi-lingual processing with the DSTO (Defence
      Science & Technology Organisation); and Senior Manager, Projects and
      Research, managing language processing research for US-government-funded and
      commercial projects at Appen P/L, a company providing speech and language
      databases for language applications. Estival is currently Project Manager, the
      Big ASC (Australian Speech Corpus) at UWS, where she has managed rollout of
      software and hardware to 17 Australian sites for AV recording of the AusTalk
      corpus. Estival is a founding member of the
      Australasian Language Technology Association (ALTA) and in 2008
      established the Australian Computation and Linguistic Olympiad (OzCLO). Dr Estival has a wealth of experience in academia and
      industry, including project management of large collaborative projects and
      will, therefore, be employed at the Manager level.

    • Product Owner,
      Associate Professor Steve Cassidy, Macquarie University is a Computer Scientist whose
      research covers the creation, management and exploitation of language
      resources. Cassidy has a PhD in Cognitive Science and has worked in both
      Linguistics and Computer Science departments. He is the main author of the Emu
      speech database system which is widely used in the creation and analysis of
      spoken language data for acoustic phonetics research. He has been involved in
      the standardisation of tools and formats for the exchange of language resources
      starting with his work on the Emu system and more recently as an invited expert
      on the ISO TC 37 working groups on annotation interchange formats and query
      languages for linguistic data. He has been instrumental in establishing the
      Australian National Corpus as an umbrella organisation to manage language
      resources in Australia and is an active collaborator with similar projects in
      the US and Europe.  Cassidy will act
      as Product Owner for this project and as such will act as a conduit between the
      development team and the prospective users around Australia as well as ensuring
      that the product is interoperable with related international efforts.

    • eResearch Analyst, Peter Bugeia (Intersect, UWS) is the eResearch Analyst for the
      University of Western Sydney. He has 27 years IT experience across a wide range
      of industries including medicine, banking, finance and media. He has worked in
      commercial, not-for-profit and public sectors and has held various roles from
      Senior Software Engineer and Test Manager to Project Manager, Enterprise
      Architect and Business Analyst.

    • Intersect Project Manager. Georgina Edwards is Intersect’s technical development
      manager assigned to manage the software development aspects of the
      project. She has over 10 years experience in commercial software design
      and development, and has worked in banking and finance as well as eResearch. Edwards
      has a BE (Hons) in IT & Telecommunications from the University of Adelaide.
      She has experience building web applications in a range of languages, and is
      also an experienced agile practitioner. Edwards will work closely with the
      Project Manager and Product Owner to ensure proper and effective integration
      between the stakeholder community and the software engineering team. Edwards
      will also work with Intersect’s management team comprising of Dr Ian Gibson (CEO),
      Rodney Harrison (Engineering Manager), Dr Joe Thurbon (Services
      ), Shane Youl (IT Manager) to ensure that the development and
      operations of the project are staffed appropriately and executed efficiently.


    The HCS
    vLab will constitute a collaborative environment for access and analysis of human
    communications data
    by HCS tools. It will provide resources to create new
    annotations for existing data, and a space for researchers to store new data
    and tools for use by the research communities. The overall structure of the
    environment is shown in Figure 1. The HCS vLab is designed to make use of
    national infrastructure – including data storage, discovery and research
    computing services. It incorporates existing eResearch tools adapted to work on
    shared infrastructure, and orchestrated by a workflow engine with both web and
    command line interfaces.

    Estival will be Project Manager and will be working with A/Prof Steve Cassidy,
    the Product Owner. The Project Manager will be part of the Steering Committee
    and have oversight of the development undertaken at both Intersect and

    Functional Overview

    The HCS
    Virtual Laboratory will provide researchers with an integrated environment in
    which to select and perform analysis on Corpora through a suite of
    pre-installed tools. The HCS vLab will be designed for use by both IT-technical
    and non-IT-technical researchers, with user interaction through a Web
    interface. General functionalities of the HCS vLab are as follows:

    1. Users can browse lists of corpora containing Human Communications
      and of pre-installed HCS Tools,
      including corpora from the already-funded NeCTAR Virtual Lab, HuNI (the
      Humanities Networked Infrastructure) (see Tools and Corpora below, and also B.18).

    2. Users can select either a single corpus for search and analysis or
      several corpora to perform a federated search. Some users will also be able to select
      and add their own data sets for search and analysis.

    3. Users then select one or more tools by which to analyse the selected
      data. The system will display runtime options for the selected tools. Users
      will then be able to choose and save the most appropriate options for their
      analysis. The system will ensure only valid options are selectable.

    4. Once one or more corpora or data sets have been selected and tool
      options have been chosen, the user can invoke execution of the tools and the vLab
      will run the tools in their execution environment. Some tools may be configured
      to run in multiple execution environments, and a special HPC execution
      environment will be available for compute-intensive tools (Intersect has
      provided an initial in-kind allocation of HPC computing time for the HCS vLab).
      During execution, vLab will copy and/or make available data and files to the
      selected tool transparently to the user.

    5. The user will be able to monitor and control execution as it proceeds,
      and terminate if necessary; the user will be able to request a change in the
      computing resources assigned to an executing tool.

    6. Once execution is complete, the user will be able to view the results
      through the Web interface.

    7. Tools will either automatically add results to the Annotation and
      Record-Level Metadata Stores and/or the user will make these updates manually
      through an Annotation Service. Annotations and metadata will be private to the
      originating researcher until the researcher chooses to make them publicly

    8. The UWS Research Data Catalogue, managed by the UWS library (established
      with the help of ANDS Metadata Stores and Seeding the Commons programmes) will
      play a key role in describing and disseminating descriptions of corpora, tools
      (as services) and annotations (new data sets).


    • A command line interface
      will also be available for users.

    • For Desktop Tools, the user
      will be able to interact with the application once it has been initiated.

    • The user will be able to
      chain the execution of Tools together, and capture and share these workflows.

    • As some of the Corpora data
      are sensitive, Corpora will only be available to researchers who are
      appropriately authorised for that particular corpus. This will be checked
      through the “User Registration” service.

    • Users will be able to
      request the addition of Tools and Corpora to the vLab, and to store their own
      private data, subject to the level of commitment to the Virtual Lab from their
      organization as described in the Partner Model, with appropriate service
      support and regulation for authority.

    Technical Overview

    The HCS
    vLab, in particular the Workflow Engine, will be built around the Galaxy open
    source workflow management system (see also B.11). The proposed Technical
    Architecture of the HCS vLab is as follows:

    • An instance of the Workflow
      engine will be run on a virtual machine in the Intersect, NeCTAR, or other

    • The Standard Execution
      Environments will be pre-configured virtual machines that have one or more
      tools installed. The HPC Execution Environment is likely to be a standard
      operating environment with no virtualisation.

    • The Tool & Corpora
      Definitions, Record-Level Metadata Store and Annotation Store will be
      cloud-based databases which are available to all virtual machines that form
      part of the vLab.

    • The Corpora will live on
      the Intersect RDSI Node, or some other RDSI node, closely located to HPC
      compute resources.

    • Within an execution
      environment, tools will be run in either the foreground or in batch mode,
      depending on what is appropriate for the tool and the environment.

    Description: VLab Diagram v4.jpg

    Tools and Corpora

    The HCS vLab will integrate existing corpora that house human
    communications data,
    consisting of
    language and music data, in the three most common modes in which these are
    represented – audio, auditory-visual and text. The corpora to be made
    available in this project are all corpora which our participant members have
    either established or are caretakers for, so we have direct access to these.
    They are presented below in the order in which they will be incorporated into
    the framework (see also B.13) and further details regarding platforms, UIs,
    input and output etc. are provided in Appendix B. The order of incorporation of
    these corpora into the framework is chosen by consideration of the joint
    factors of maturity of each corpus and the amount of work required to adapt
    them for incorporation into the HCS vLab.

    1. PARADISEC (the Pacific and Regional Archive for Digital Sources in
      Endangered Cultures, including Indigenous languages music, and speech) (5.1TB);

    2. AusTalk AV speech corpus from the BigASC project (7TB);

    3. the Australian National
      Corpus (incorporating the Australian Corpus
      of English (ACE), Australian Radio Talkback (ART), AustLit, Braided Channels,
      Corpus of Oz Early English (COOEE), Email Australia, Griffith Corpus of Spoken
      English (GCSAusE), International Corpus of English (Australia contribution is
      ICE-AUS), the Mitchell & Delbridge corpus, and the Monash Corpus of Spoken
      English (5TB);

    4. AVOZES visual speech corpus (15GB);  

    5. Australian Music Centre archive (extremely large collection of sound and text: over 30,000 items by 530 artists);

    6. Colloquial Jakartan Indonesian corpus (audio and text 32.5TB);

    7. The ClueWeb09
      dataset (100TB).

    The HCS vLab will also integrate existing HCS tools for the analysis of
    music, speech and written text and make them accessible to non-technical
    researchers, while maintaining a command line functionality for more
    sophisticated analyses. Nine of these are tools that have been developed by our
    participant members
    and these are listed below in the order in which they will be incorporated, by
    consideration of the joint factors of maturity of the tool, the amount of work
    required to adapt the tool for incorporation into the HCS vLab and the order in
    which corpora will be incorporated (see B.13).

    1. EOPAS (PARADISEC tool) for text interlinear text and media analysis;

    2. NLTK (Natural Language Toolkit) for text analytics with
      linguistic data;

    3. EMU for search, speech analysis,
      interactive labelling of spectrograms and waveforms;

    4. AusNC Tools: KWIC, Concordance, Word Count,
      statistical summary and statistical analysis on a user-defined subset of

    5. Johnson-Charniak parsers, to generate full parse
      trees for text sentences;

    6. ParseEval, tool to evaluate the syllabic parse
      of consonant clusters;

    7. HTK – modifications, a patch to HTK (Hidden Markov Model Toolkit, to enable missing data recognition;

    8. DeMoLib software for video analysis; and

    9. PsySound3 (physical and psychoacoustical
      algorithms) of complex visual and auditory scenes.

    10. ParGram (grammar for Indonesian).

    11. The INDRI tool for information retrieval with
      large data sets.

    For each of the corpora
    and tools listed above, there is a HCS expert who will work in-kind for 3 weeks
    (15 effort days) with the Project Manager, the Product Owner and Intersect to
    incorporate these corpora and tools into the HCS vLab (see B.13).

    Collaboration with HuNI (an already funded NeCTAR Virtual

    The Virtual Lab will query the HuNI virtual
    lab using a protocol to be negotiated (e.g. OAI-PMH or Atom) for information
    about corpora known to HuNI which may be of use to VL users. Appropriate
    corpora with sufficient metadata can then be loaded into the HCS vLab and used
    with the suite of HCS vLab Tools. Resulting data will be stored in the HCS vLab Annotation
    store, and the existence of the new data advertised back to HuNI via an
    appropriate mechanism. Metadata exchange will use an appropriate
    standards, such as OAI-PMH with discipline and corpus-appropriate metadata
    schemas (EAC-CPF for information about parties, OLAC for linguistics
    resources, MARC for bibliographic description).

    Figure 2 shows a potential workflow where an
    HCS vLab user is able to acquire data from HuNI, transcribe it, and publish it
    so that a HuNI user can access the original HuNI object along with the
    transcript using one of the HCS web-based tools, giving them access to new data
    has been created in the HCS vLab environment. Similarly, HuNI users will
    be able to access key HCS vLab tools, such as time-aligned transcription tools,
    which save data in standard reusable formats, rather than using ad hoc
    solutions as is currently often the case.

    Figure 2: A potential scenario
    involving interoperable metadata, tool and data exchange

    between HCS vLab and the HuNI


    Target Research Community

    The primary target research community to use the HCS vLab is
    encompassed by HCSNet (see B.2), an Australian community with >1200 active members
    from Speech, Language, Music
    and Sonics areas and around 45,000 international members. More specifically,
    the international music community that wants to have the AMC not only more easily
    available but open to different kinds of searches is estimated at around
    10,000. The research communities that would benefit from Text and Speech
    corpora and associated tools can be estimated at 20,000 (linguists, speech
    scientists, behavioural scientists and language technologists). For video and
    visual analysis, the international community is estimated at around 15,000
    researchers and is steadily growing.

    The Australian and international HCS community
    also intersects and overlaps with various professional bodies as set out below
    and, via these, with their international counterparts.

    • The Australasian Speech Science and Technology Association
      (ASSTA, est. 1972, is the peak
      body for speech scientists in Australia and New Zealand and is a cash
      contributor to this bid. ASSTA has various Corporate Members (Appen Butler
      Hill, Cochlear Pty Ltd., the HEARing CRC, Spectral Dynamics) and 110 members
      from the disciplines of engineering, computer science, cognitive science, psycholinguistics,
      language technology, phonetics, linguistics, forensic speech science, and
      speech pathology. ASSTA runs the biennial Speech
      Science & Technology conference, provides HDR student travel assistance
      for international conference presentations, and funds speech
      science and technology research events and initiatives. ASSTA has close ties to
      ISCA, the International Speech Communication Association, which attracts >
      1000 registrants to its annual conference, Interspeech. ASSTA members make up
      the core set of researchers working in the various manifestations of speech
      science in Australasia. Almost all the members were active participants in
      HCSNet, and ASSTA members are at the forefront of grant-getting, publication,
      and PhD supervision in speech science in Australasia.

    • The Australian Linguistics Society (ALS, est. 1967, with more than 450 members, is the peak organisation for
      linguists in Australia. It runs the annual ALS conference and a biennial
      Australian Linguistic Institute (ALI). Many ALS members participated in HCSNet
      activities. The main international counterpart is the LSA (Linguistic Society
      of America) with around 4,500 members, and there are individual Linguistics
      organisations in many countries around the world, with an estimated combined
      total of 10,000.

    • The Australasian Language Technology Association (ALTA, est.
      2002, with around
      240 members promotes research in Computational Linguistics and Natural Language
      Processing. It is a founding regional organisation of the Asian Federation of
      Natural Language Processing, runs the annual ALTA workshops,
      and manages funds for OzCLO (the Australian Computational and Linguistics
      Olympiad). Its main international counterpart is the ACL (Association for
      Computational Linguistics), with around 5000 members and there are more
      specialised organisations, such as AMTA (Association for Machine
      Translation in the Americas), EAMT (European Association for MT), AFNLP (Asian
      Federation of Natural Language Processing), with an estimated combined total of
      10,000 researchers around the world. Most ALTA members participated in HCSNet

    • The Australian Music and Psychology Society (AMPS, est.
      1996,, with 200 members, is
      a member of the Asia-Pacific Society for the Cognitive Sciences of Music
      (APSCOM) and has links with ESCOM, the European
      Society for the Cognitive Sciences of Music and SMPC
      the (US) Society for Music Perception and Cognition.
      AMPS runs
      workshops, seminars and conferences, and provides HDR student travel assistance
      for international presentations, e.g., at the annual International Conference
      on Music Perception and Cognition which AMPS hosted in Sydney in 2002. Many
      AMPS members participated in HCSNet activities.

    Needs and Impact


    The HCS research community has produced a number of corpora and
    repositories for human communications data and many HCS tools to
    manipulate, process and analyse these data. But the use of and access to these
    data is hampered by two constraints.

    First, as in so many other disciplines,
    the amount of data available is growing
    , making it increasingly infeasible for any one researcher or
    research laboratory to maintain up-to-date locally stored datasets. Even where
    the storage capacity to do this exists at a given institution, the many-ways
    replication that local copying strategies encourage only introduces version
    inconsistency problems that outweigh the advantages of redundancy. Cloud-based
    storage, with appropriate backup procedures, is widely accepted as the best way
    forward, ensuring that all users see the same data. Similarly, cloud-based
    analytical tools relieve individual researchers and sites of the need to
    maintain up-to-date versions of software, outsourcing software infrastructure
    tasks that are not, and should not be, core business for a researcher.

    second problem is unique to interdisciplinary
    . It is already well-recognized that a major impediment to
    interdisciplinary interaction is the fact that we ‘speak different languages’: two
    researchers aiming to cross a disciplinary divide need to take the time to
    understand how each uses terminology in often subtly different ways. A related
    phenomenon is present where ‘the rubber hits the road’ in terms of actual
    software and data usage: although a researcher in one discipline may have much
    to gain, and much to offer, from the use of tools and corpora developed in
    another discipline, all too often the hurdles to making this a reality are
    immense, posing a learning curve that has characteristics not dissimilar to the
    culture shock one faces when moving to a new country. Cloud-based tools and
    support cannot erase the difficulties here, but they provide an opportunity for
    interfacing and modularising that make it easier to overcome them.

    further key concern that our proposal addresses is that of replicability. At a time when conventional science is subject to
    disruptive forces—debates over open access models, attacks on the peer
    review process, and a sense of public distrust—this element of the
    scientific process remains indisputably indispensable. But replicability has
    always been hard, and gets harder as experiments require ever-more technically
    sophisticated tools, and make use of ever-larger data sets. This problem has
    been recognized in Big Science, whether dealing with web click histories,
    consumer purchasing patterns or astronomical data. But it is all too easy to
    overlook the importance of replicability in sciences, like Human Communication
    Science, that rely on ‘mid-size’ data sets. Our HCS vLab proposal addresses
    this problem head-on, by providing a cloud-based platform that supports
    user-defined experimental workflows using standardised public-domain data sets
    and cloud-friendly revisions of existing desktop analysis tools. In this regard,
    our aim is to develop and provide a world-leading best-practice platform for
    scientific replicability in the human communication sciences.

    address these issues
    we need a connected repository for
    data and integrated access to tools. The HCS community will benefit from (i)
    easy access to data collected in other disciplines, e.g., speech and video
    researchers accessing the indigenous language PARADISEC data, linguists
    accessing the AusTalk transcripts, and the speech/language components of music
    in the Australian Music Corpus; and (ii) the use of tools developed in
    neighbouring disciplines to process and manipulate their own data, e.g.,
    linguistic tools to process the language components of the AMC, visual analysis
    and lip-tracking tools to analyse the video components of AusTalk, and musical
    and acoustic analysis tools to examine rhythm and melody in speech corpora.

    The potential impact of making these
    tools and resources easily available to the vastly distributed Australian
    research community is vast. This is perhaps most obvious where the resources
    and tools have a direct connection to commercially-valuable technological
    developments, as in the areas of speech and language technology, and music
    processing software. Australia’s small size means that we will always struggle
    to compete against the major players in the US and Europe, and increasingly in
    Asia, but HCS vLab will enable a pooling of resources, data and tools that will
    encourage a higher degree of collaboration amongst our researchers, and allow
    us to do more with less.

    less obvious are the more niche areas of research that are all too often
    ignored in the push for short-term technological wins. Here, HCS vLab provides
    an opportunity to enfranchise and strengthen more isolated research activities.
    For example:  

    • Thieberger
      (U Melbourne) has built a corpus of recordings and time-aligned transcripts of South Efate, a language
      from Vanuatu. This rich set of material could be accessed by others, but
      currently there is no platform to make it available. It requires streaming
      media linked to texts with an annotation module to allow researchers interested
      in prosody, narrative structure or musicology to access and annotate the
      material, safe in the knowledge that both the primary material and the new
      annotations will all have persistent locations.

    • PARADISEC has
      many legacy audio speech and music recordings that are not annotated beyond a
      gross one-line description per audio-tape. In some cases it is not even clear
      what language is represented. Annotation would be greatly facilitated by
      crowd-sourcing; HCS vLab would provide an online space with an easy-to-use GUI
      where native language speakers and other researchers could access and annotate
      the recordings.

    • There is a wealth of Aboriginal language dictionary and text material
      that once formed the Aboriginal Studies Electronic Data Archive (ASEDA) digital
      archive collected between the late 80s and early 2009 by the Australian
      Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS). The
      establishment of the HCS vLab would make possible collaborative work with
      AIATSIS to make the ASEDA digital archive (also available as AUSTLANG) more visible.

    • There is an increasing awareness that the temporal dimension of speech
      encodes a substantial amount of information about linguistic structure.
      Extracting this information requires experimental methods for acquiring high
      temporal resolution speech movement data, software for data analysis and
      computational modelling tools for linking the data to linguistic structure, but
      there is no accessible platform by which such analyses can be effected and no
      standards for analysis have been established, and many of the analytical tools
      remain inaccessible to the broader research community. HCS vLab will provide
      access to Shaw’s ParseEval, a tool
      that allows the syllable structure to be ‘read’ from speech data. Other tools,
      such as DemoLib will provide
      analysis of audio-visual speech.

    Much has been written and said about the imperative
    to make research data widely available, especially when its creation is
    publicly funded; but the commitments signed up to in ARC proposals are more
    honoured in the breach than in the observance. This is in no small part due to
    the difficulty of the mechanics of making data available. HCS vLab will provide
    a platform for researchers in the human communication sciences that overcomes
    this problem, allowing increased leverage of existing investments and, in the
    process, making the data accessible to a wide range of existing tools in a
    streamlined fashion. For instance, with regard to the AusNC, HCS vLab will
    provide web-based tools for the display and analysis of annotations on
    linguistic material; natural language processing tools for text; automatic
    tools to support the generation and classification of annotations from audio
    and textual material; and tools for transcoding and streaming of audio and
    video in HMTL 5 ready format; and HCS
    vLab will make available the Ethnographic E-Research Online Presentation
    System to
    present interlinear glossed text and media for other corpora. In addition, with the assistance of ongoing
    infrastructure support from UWS after the project is completed and HCS vLab is
    established, there are new tools that could be incorporated into HCS vLab, e.g.,
    with the Australian Music Centre, an IR (information retrieval) tool for
    music and other acoustic data; and musical scores
    in XML form for musical academic research.


    Australian HCS community responded positively to the formation of HCSNet by
    attending the 60 HCSNet workshops and seminars, by an increase in successful
    large grants in the HCS area (see B.2) and by significant high impact journal
    publications (see the top 30 papers to come out of HCSNet in Dale, R., Burnham,
    D. & Stevens, C.J. (2011) Human
    Communication Science: A Compendium
    ). Through HCSNet, the ARC has already
    invested in the development of a strong interdisciplinary community that has
    been widely recognized overseas; HCS vLab is an opportunity to both build upon,
    and reach far beyond HCSNet to provide that community with the tools and
    resources that it needs to leverage our distributed research capacity. The user
    base is ready and waiting. The impact of the HCS vLab on HCS research in
    Australia will be significant, far-reaching and sustainable. It will take HCS research
    capability to the next level, beyond the individual modalities of speech,
    language and music, and above that which can be accomplished on the ground in
    individual centres and institutes.

    HCSNet provided
    the opportunity for applying novel combinations of old ideas or methods of
    analysis from different disciplines to new problems. For instance, in the HCS
    Compendium, Butavicius and Lee (2011) describe a multi-disciplinary approach
    involving visual perception, human-computer interaction and cognitive modelling
    to the problem of assisting users to find relevant information in very large
    data sets; and Copland et al. (2011) married an exisiting psycholinguistic
    semantic priming task with fMRI to address the role of dopamine on
    neurotransmitters in semantic processing. However, the HCSNet experience also
    made it abundantly clear that one of the main impediments to such quantum leaps
    in HCS research was the difficulty for a researcher from one discipline to
    apply the tools and techniques of another discipline, or to explore data
    collected under one paradigm via a completely different analytical perspective.
    As HCSNet proved to be a model for inter-disciplinary collaboration, the
    HCS vLab will be an exemplar for other research communities.

    impact and benefits of the HCS vLab activity will be tracked via the UWS
    Service Desk. The following measures will be used to report on utilisation and
    uptake by the research community: and measured through usage count (number of
    users logged in, number of tools used, number of queries) and user surveys
    distributed and collected at each of the 3 vLab phases (see B.13).

    • Project / developmental measures:

      • number of
        functional tests performed by researchers

      • number of
        researchers who have participated in requirements gathering and testing

    • Production measures:

      • number of
        researchers with login accounts

      • number of actual
        researcher login to VLab

      • transaction
        counts: e.g. searches performed, invocation of tools,  annotations generated and committed to
        the annotation store.

    Future Developments

    Once the tools described in
    section B.7 have been incorporated, other well-known public access tools, such
    as ELAN (professional tool for the
    creation of complex annotations on video and audio resources); CLAN (Computerized Language
    ANalysis, for Conversational Analysis)
    ; PRAAT (scientific software program for the analysis of
    speech in phonetics); and
    The Field Linguist’s Toolbox (data management and
    analysis tool for field linguists), will be considered for addition to the HCS
    vLab. See Section B.18 for a list of projects which will be interoperable with
    the HCS vLab and for a list of additional corpora and tools which project
    partners already plan to add.

    Project partners have
    already indicated their intention to add the following:

    • U. Sydney

    • ExSite9 – a tool for
      efficient adding of metadata for digital research data as it is collected in
      the field.

    • NABU – a catalog system for
      research collections with the ability to provide streaming access to media
      objects as well as providing a management environment.

    These tools could work together
    with EOPAS: 1) ExSite9 for assembling and linking data prior to submission to
    repository; 2) NABU for managing the data repository and providing access to it
    where possible; 3) EOPAS for providing fine-grained online access to annotated
    media in the repository

    • U. Canberra:

    • UCBN – a broadcast news
      database with audio-visual video sequences (both text dependent and text
      independent subsets). There are around 20 speakers and total footage is around
      6 hours of recording about 100 MB.

    • MultimodalFusionLibrary – a
      software Library in C# for audio visual processing with several pre-processing,
      feature extraction, learning and classification algorithms mainly for
      stereoscopic and 3D video sequences.

    • UWS:

    • The AV face cover corpus through our MOU with the BBfor2 (Bayesian Biometrics for
      Forensics) Euro-funded project (BBfor2 is funded by the EC as a Marie-Curie ITN-project
      (FP7-PEOPLE-ITN-2008) under Grant Agreement number 238803.)

    • The ABC corpus of infant-directed

    B.10    Broader

    The flexible HCS vLab framework
    coupled with the ongoing support by the UWS eResearch Unit will facilitate the
    inclusion of future corpora and tools. For example:

    • A new (awarded 2012) Discovery
      (Best, Shaw, w/ PIs Hay, Foulkes, Docherty, Evans: ‘You came here TO DIE?!’ will collect audio recordings of
      a carefully-constructed corpus of words and phrases in 5 English regional
      accents (AusE, NZEng, Cockney, York and Newcastle-upon-Tyne). That corpus, plus
      earlier sets of Jamaican and American English materials collected in a current
      ARC Discovery project at UWS ‘How strict is the Mother Tongue?’ will be added
      as a database in HCS vLab. The pending project includes plans to run
      computational modelling on that corpus to determine relationships among
      pronunciations of words in the 5 accents, and to relate those modelled
      parameters to real-human word recognition across accents.

    • A current ARC Discovery
      project at UWS (DP DP110105123,
      2011-2015, The Seeds of Literacy) will generate a corpus of caretaker speech to
      100 children at 3-monthly intervals from when they are 6-month-old infants to
      5-year-old children and this will be added as a database in HCS vLab.

    • A recently awarded (2012)
      ARC LIEF grant ‘A Living Archive of Australian Indigenous Languages’ through ANU is designed to
      digitise vernacular literature from NT Aboriginal schools, and the HCS vLab
      would provide a suitable home for these data.

    • At ANU there are plans to augment a current fledgling corpus of Spanish
      with new data in order to conduct comparative research. Again the HCS vLab
      would provide a suitable home for these data.

    In addition, a number of international
    labs across the world will welcome HCS vLab, for example:

    • The New Zealand Institute for Language, Brain and Behaviour (NZILBB), a
      close partner of Marcs Institute, has both databases and tools they have
      developed for recordings of NZ English and NZ Maori and Maorian-English, which
      could be included as part of HCS vLab in the future.

      • The University of Southern California
        is currently developing Electomagnetic Articulograph and real-time dynamic
        Magnetic Resonance Imaging speech databases and tools, and members of that
        project, Shri Narayanan and Mike Proctor are some of the many international
        collaborators of Australian HCS researchers.

      • The recent
        NSF grant for ‘The Language Acquisition Grid: A Framework for rapid Adaptation
        and Reuse’ project led by Nancy Ide of Vassar College will provide a grid-based development environment for
        building natural language processing pipelines, facilitating application
        building and experimentation. As an international collaborator, the HCS
        vLab Product Owner, Steve Cassidy, will ensure that
        the Language Application Grid is compatible with HCS vLab infrastructure
        through the use of shared data standards and application interfaces.

    B.11    Value

    ANDS – Australian Research Data

    HCS vLab will be integrated with the Australian Research Data Commons. The
    project will manage collections of research data, and enable the selective
    export of descriptions of the data collections via OIA-PMH.

    AAF – Authentication

    HCS vLab will provide authentication against the AAF through the SAML2
    protocol. This will allow appropriately authorized individuals to authenticate
    using their institutional credentials. Where that is inadequate, dual
    authentication systems will be used, as outlined in the Federation’s technical
    documentation. Where possible, international collaborators will be
    authenticated in the system using the existing reciprocal arrangements between
    the AAF and its international partner organizations.

    NCI – National Computational

    HCS vLab will integrate with NCI by submitting jobs to the HPC system.
    Authentication will be based on the credentials of the user logged in to the
    proposed system. SSH will be used as the protocol. Access to the HPC system
    assumes that the user of the proposed system has CPU time allocated to them on
    the HPC system, for example through the national Merit Allocation Scheme.

    RDSI – Research Data Storage

    HCS vLab will host human communications datasets of national significance, as
    described in items 2 and 3 of this section. We plan to host these datasets on
    RDSI on the Intersect node. We anticipate strong alignment between this project
    and the RDSI ReDs programme. We are planning to host this project on the
    proposed Intersect Research Cloud node, which is to be co-located with the
    proposed Intersect RDSI node, and Intersect’s existing infrastructure.

    Research Cloud

    HCS vLab will be built using the OpenStack Cloud API, to allow integration with
    the NeCTAR Research Cloud.


    The HCS
    vLab acknowledges the Galaxy-GDR Integration NeCTAR project which will provide
    designs and insights that will be useful to this project, and vice versa.

    The Australian National Corpus

    The Australian National Corpus was
    funded by the Australian National Data Service (ANDS) to collect together a
    number of existing language corpora on a single technical platform that unified
    the many data and meta-data formats and provided a uniform set of interfaces to
    browsing and searching these data sets. The ANDS project developed an ingestion
    process that could be used with a wide variety of data types including text,
    audio and video collections, bringing them into a single standardised data
    store that is then exposed via a web interface. Another outcome of the project
    is a legal and ethical framework for sharing language resources within the
    research community. The Australian National Corpus platform will provide a
    starting point for the storage of data and meta-data in the HCS vLab. It will
    be extended to allow ingestion of new data types and to support federated
    storage of resources and meta-data.

    The Humanities Networked Infrastructure (HuNI)

    The Humanities Networked Infrastructure (HuNI) is one of the NeCTAR Virtual
    Labs being established. The HCS vLab and HuNI are complementary to each other
    and a compatible feed pipeline will be built
    between the two (see B.18).


    B.12    Governance

    HCS vLab
    Project Organisation

    authority structure for the HCS vLab project reflects the partnership model as
    described in Appendix E. Two key governance groups will be formed: a Steering Committee and a Stakeholder Group. A number of key
    project roles will interact with these groups, as described below.

    The Steering Committee is a small, senior
    group with overall responsibility for and authority over the project and its
    resources. The Steering Committee is charged with ensuring the project achieves
    its objectives, that stakeholders receive the benefits of the HCS vLab, and
    that these benefits are sustainable into the future. An important aspect is that the
    Steering Committee is responsible for trading off between the interests of the
    Stakeholder Group when necessary. The Committee consists of:

    • Project
      Director – Professor Denis Burnham, UWS

    • UWS
      eResearch Manager – Dr Peter Sefton, UWS

    • Senior
      Partner Representatives – Representatives from the major partners in the

    • Occasional
      members: other stakeholder representatives have been identified as having
      particular expertise across areas of HCS and will be called upon to for
      particular meetings as required.

    • A
      representative from NeCTAR will be invited to attend as an observer.

    The Steering
    Committee will meet monthly. The HCS vLab Project Manager (Dr Dominique
    Estival) and the Intersect Project Manager (Georgina Edwards) will be required
    to attend Steering Committee meetings to report progress and resolve issues.  The Steering Committee will be provided expert
    technical advice through the Product Owner and the Intersect Project Manager. The
    Steering Committee is expected to become the Executive User Group once the
    project has completed.


    • The Project Director is the Chair
      of the Steering Committee, and is responsible for overall direction of the
      project and liaising with UWS executive.

    • The HCS vLab Project Manager will be
      responsible for overall project planning, tracking and control, change
      management, stakeholder communication both within and external to the project,
      vLab marketing, user support, and acceptance testing. The Prince 2 project
      methodology will be used, leveraging off templates available through UWS
      Information Technology Services’ Project Management Office. These methods will
      be tailored somewhat to integrate with Intersect’s agile Scrum software
      development method. The vLab Project Manager will work with the Steering
      Committee to establish key success factors, metrics and project quality gates
      for assessing project performance.

    • The Intersect Project Manager will
      be responsible for software development and co-ordinating all requirements
      analysis and software development activities associated with the project.
      Intersect will be responsible for producing the solution design, a working vLab
      application, and for integrating tools and corpus’ into the vLab. Intersect
      employs an agile approach to software development – see Appendix C for
      more information.

    • The RMIT / NICTA Software Developer will
      be responsible for building software interfaces which enable the vLab to
      interoperate with other applications and vLabs as described in B.7

    • The Product Owner will be
      the main point of interface between the software engineering team and members
      of the stakeholder working group, as per Intersect’s software methods which are
      described in Appendix C. The Product Owner (see B.6) will be responsible for
      the injection of domain-specific information into the framework, and will work
      closely with the vLab Project Manager to gain input from external stakeholders
      as and when required.

    The Stakeholder
    comprises as wide a collection of representatives of end-users as
    possible to ensure the widest possible engagement. Importantly, this group will
    provide subject matter expertise in defining the functional requirements for
    the vLab, and will participate in user testing. The primary stakeholder group
    has already been identified: they are the HCS Tool Authors, HCS Corpus
    and HCS academic experts
    from across the country, who will interact with the Intersect software
    engineering team through the Product Owner. They will, along with academic
    experts in particular areas of HCS and their doctoral students, test releases
    of the HCS vLab (see Section B.15 for a more detailed description). There are
    currently 47 academics (see Appendix A) who are committed to participating in
    the Stakeholder Group. Further stakeholders will be nominated during the
    elaboration phase, and will include members of the Steering Committee.

    B.13    Project
    Scale, Key Deliverables and Acceptance Criteria        

    Project Scale

    Please see Part D3.2 for more detail. A summary
    is given below:

    Project Duration

    12 elapsed months

    Total contingent effort

    190 effort months

    FTE (during project)


    Funding Requested from


    Co-investment (during


    Additional Co-investment
    (post-project) to 31/12/2014


    Grand Total


    • The effort estimates have been built according to Intersect standard
      estimation approaches.

    • A standard 20% contingency has been applied to the estimate.

    • Project risk profile is Moderate. Overall mitigation strategy is to
      build the system Design To Cost

    • FTE is a combination of project management personnel, product ownership,
      software developers, testers, researcher developers,
      and user representatives.

    • $423,000 of the co-investment is in cash. This will be used to part-fund
      the Product Owner role, and for additional software development.

    • Co-investment
      is from the following partners: UWS,
      Macquarie, Melbourne, Sydney, ANU, Flinders, RMIT, NICTA, UNSW, UWA, UNE, UC,
      LaTrobe, UTas, ASSTA, AusNC Inc.

    Key Deliverables
    and Acceptance Criteria

    The 7 corpora, and the 11 HCS Tools developed by HCS
    members to be made available in the HCS vLab project (see also B.7) will be
    incorporated at different phases (N=3) of the project in an order determined by
    consideration of the joint factors of maturity of the corpus and the amount of
    work required to adapt the corpus for incorporation into the HCS vLab.

    Table 1: Key Deliverables and Acceptance Criteria





    Start date
    – contract signed


    Statement complete

    Signed off
    by Steering Committee; submitted to NeCTAR



    Signed off
    by Steering Committee; submitted to NeCTAR


    Project plan; Quality plan; User

    HCS vLab Architecture



    Signed off
    by Steering Committee; submitted to NeCTAR Contracts signed

    Tested by collaborators and HDRs


    VLab V1: Standard Execution

    Basic Workflow platform: User
    Registration, Corpus Browse, Tool Browser, Tool Execution Service.

    RDSI; Tool and Corpora Data.

    with 2 Corpora (PARADISEC and AusTalk) and 4 sets of Tools (EOPAS PARADISEC
    tools, NLTK, EMU, Johnson-Charniak parsers).

    Tested by collaborators and HDRs

    Browse corpora and read data

    Use tools on data

    Signed off by Steering Committee;
    submitted to NeCTAR


    Service Desk operational;

    Impact and benefits tracking in

    Impact and benefits; Project Status.

    Tested by collaborators and HDRs

    Signed off
    by Steering Committee; submitted to NeCTAR


    VLab V2:

    Execution Environment: NeCTAR
    Research Cloud; Basic UIMA Data Bus.

    Workflow: Security and Access
    Control; Annotation Services; Federated Search

    UWS Research Data Catalogue;
    Annotation Store.

    Additional corpora (AusNC, AVOZES,
    AMC) and sets of tools (AusNC Tools, HTK, DeMoLib, PsySound3, INDRI).

    Impact and benefits; Project Status

    Tested by collaborators and HDRs

    federated / cross corpora search


    Signed off
    by Steering Committee; submitted to NeCTAR


    VLab V3:
    HPC Execution Environment.

    Corpus Management Service, Tool Management Service, Workflow Capture.

    Metadata Store;

    Access to
    HuNI Vlab Data and Tools

    tools (ParseEval,
    NuancesWithMidi, ParGram),
    corpora (Jakarta Indonesian, Forensic).

    Impact and benefits; Project Status

    Tested by collaborators and HRDs

    HuNI Data and Tools

    Create and
    reuse workflows

    Signed off
    by Steering Committee; submitted to NeCTAR


    Post-implementation Review (PIR)

    conducted; report  signed off by
    Steering Committee and  submitted
    to NeCTAR;

    Completion Certificate accepted by NeCTAR.


    Levels met and reported to NeCTAR as defined.

    B.14    Staged


    Milestone No.

    Name of Service/Deliverable

    Date of deployment for
    pilot use

    Date of deployment as
    production service


    Sub-contract signed, project started




    Elaboration Phase Complete





    HCS vLab Version 1 Operational




    HCS vLab Version 2 Operational




    HCS vLab Version 3 Operational




    Final Admin Closure




    Application Support and Maintenance, User Group to Dec 2014




    Application Support and Maintenance, User Group to Dec 2015



    B.15    Project


    detailed description of Intersect’s project approach is in the attached
    document “Intersect Software Development Process” (Appendix C) and this is a
    summary of that document. Intersect’s approach to running the project rests on
    two principles: 1) Ongoing communication with the governance group, reporting
    to them and soliciting input; and 2) Continual integration of a tested and
    deployed product.

    A project goes through four
    stages: Concept; Elaboration; Development and Deployment. These four stages
    will be spread across the 14 months of the HCS vLab project as set out below. The
    Stakeholders Group and the Steering Committee will be involved in all four
    stages of the project, with the Project Manager and the Product Owner ensuring
    communication between the Steering Committee, the Stakeholder Group and the
    Intersect development team and overseeing the integration and testing of the
    HCS Corpora and Tools. HCS
    end-user testers, drawn from the Stakeholder Group, will be responsible for
    testing the software after functionality has been developed.

    specifically, for the HCS vLab, the Stakeholder Group will comprise:

    Tool Authors:
    The author or developer of each of the 11 HCS
    tools to be integrated in the HCS vLab will be responsible for adapting their
    particular tool to the HCS vLab environment, with the assistance of the Product
    Owner interacting with the Developer, Intersect.   

    Corpus Caretakers:
    For each of the 7 corpora to be
    incorporated, at least one HCS Corpus Caretaker (for instance AusTalk and
    PARADISEC have 2 and 3 developers respectively) will be responsible for
    adapting their particular corpus to the HCS vLab environment, with the
    assistance of the Product Owner interacting with the Developer, Intersect.

    Sprint Testers:
    The HCS Tool Authors and HCS Corpus Caretakers
    will be joined by another 26 academic experts in particular HCS disciplines to provide
    in-kind co-investment of 3 occasions of 2 days of testing the HCS vLab functionalities
    during the sprints in the development phase – see 3 below. These people
    will be deployed when tools and/or corpora of interest in their particular area
    have been incorporated (see Appendix B for a list of the tools and corpora). A
    total of 15 Higher Degree Research (HDR) students from the 15 co-investing universities
    and research institutions (see Appendix A) will be assigned to test the functionalities
    of the HCS vLab during the sprints in the development phase – see 3
    below. To provide continuity, each HDR student will act as tester for the three
    consecutive sprints in a particular development phase (see Figure 1 below) and
    then as a tester in the first sprint of the next development stage.

    1. Concept Stage

    the concept stage, the development team and the stakeholders come to an
    agreement on a concise definition of the key problem to be solved. This
    statement identifies the nature of the problem, the group whom the problem
    impacts, the impact, and the properties of a solution. The outcome of this
    stage is a problem statement. The nature of the NeCTAR RfP means that proposed
    projects are well into the concept stage by the time they are proposed.

    key outcome of the concept stage is an agreement on the problem statement.

    2. Elaboration Stage

    elaboration stage is used to bootstrap the development stage. During this
    stage, the artefacts uses to monitor and steer the project are created,
    including the initial user stories, a product backlog and a burn-up chart. In
    addition, the team evaluates and settles on the key technology choices for the

    elaboration phase concludes when the team can articulate the key technical
    risks and approaches to removing those risks; and when the team can identify
    the key roles on the project (especially key stakeholders) and key constraints
    of the project (e.g. dates). Key to the agile process is that these decisions
    may change at a later stage.

    the end of this stage, a project management plan and a quality management plan
    have been developed.

    3. Development Stage

    development phase consists of two-week sprints where the team works to complete
    the stories in that sprint. Usually planning, execution and review overlap, as
    shown below. The defining characteristic of the development stage is that at
    the end of each sprint, we have a potentially shippable increment of the
    product. If the customer wishes, we can deploy this to production systems for

    Figure 3: Illustration of when planning,
    executing and reviewing of sprints occurs

    to the start of a sprint, the sprint is ‘planned’. A set of stories from the
    backlog is selected and elaborated. This elaboration includes defining detailed
    acceptance criteria: these criteria define when the story is ‘done’ (i.e.
    acceptable to the end-user and integrated into the product). The selected
    stories are the ones that will be implemented during the sprint. During the
    sprint, the sprint is ‘executed’. Selected stories are developed, tested and
    accepted by the stakeholders. At the end of each sprint the sprint is
    ‘reviewed’ and we have a demonstration of the functionality developed during
    the sprint. Progress is then evaluated in terms of stories implemented, stories
    remaining and budget spent.

    to the completion of a sprint’s review, the user-testing team is responsible
    for testing the product against acceptance tests, and reporting defects found.
    During subsequent sprints, reported defects are assessed and planned along with
    other user stories.

    contribution of the partners to the development of the HCS vLab is as follows:

    • 15 days of effort to help Intersect
      integrate each tool or corpus contributed into the HCS vLab.

    • 30 days of effort involved in vLab
      requirements gathering and user testing.

    • 3 days of effort involved in project

    • 30 days of effort involved in user group
      meetings and post-project user support activities (Jan 2014 – Dec 2015).

    Deployment Stage

    At the end of the last sprint in the Development
    phase, we have a fully functioning, integrated and tested piece of software.
    Unlike traditional waterfall development, there is no ‘final testing’ or
    ‘acceptance testing’ phase – the software has been tested by its users throughout
    the project. During the Deployment phase the system is deployed to production
    in accordance with the hosting arrangements.

    B.16    Quality

    Quality Control and Acceptance testing are
    integrated in Intersect’s development approach, and are ongoing from the start
    of the project. This mitigates the risks of “big-bang” integration and
    acceptance testing. Part of the quality control is integrated with other
    processes, including test-driven development and writing acceptance tests
    before implementing a user story. Additional testing (e.g. non-functional
    requirements, risk-mitigation testing) is performed based on the quality
    management plan (see above).

    For each user story implemented during
    development, quality assurance is managed by interaction between the Product
    Owner (as a representative of the governance group), the developer responsible
    for that story, and the senior test-engineer on the project.

    During planning for the sprint, the Product
    Owner is responsible for defining the acceptance criteria for the story, with
    support from a test engineer.

    During the execution of the sprint, the
    developer responsible for the implementation of the story defines automated
    unit tests, to guard against regressions. Prior to the completion of the
    sprint, the test engineer is responsible for signoff of the user story against
    the acceptance criteria. Where possible, the acceptance tests are automated,
    using frameworks such as Cucumber.

    During the review of the sprint, the Product
    Owner (representing the governance group) validates that the stories
    implemented during have the correct functionality. Subsequent to the
    demonstration, the end-user testers are responsible for end-user testing of the
    system as implemented to date, as described above.

    A deliverable is comprised of multiple user
    stories. Formally, the Product Owner, acting as a proxy for the governance
    group, is responsible for the acceptance of a deliverable. This responsibility
    comprises: ensuring that a set of user stories covering the deliverable are
    defined, elaborated, tested and reviewed.

    Commissioning Testing

    Wherever possible, commissioning testing is combined with
    Acceptance testing. That is, we deploy the system early and often, and
    acceptance tests are run against a production system. This will be the NeCTAR
    Research Cloud, as available, including the lead node at the University of
    Melbourne during development.

    B.17    Risk
    and Issue Management




    This project has an external dependency on the
    development of the NeCTAR research cloud. Delays in access to the cloud
    increase the risk of difficulty in deployment and commissioning, c.f.
    Commissioning Testing, above

    Mitigation: The project has the ability to use Intersect
    storage and HPC in the event of NeCTAR delays.



    This project has an external dependency on the
    development of the RDSI project. Delays in access to the storage on RDSI
    increase the risk of difficulty in deployment and commissioning.

    Mitigation: The project has the ability to use Intersect
    storage and HPC in the event of NeCTAR delays.



    This project depends on further development of
    open-source software. There is a risk that the governance of that software
    will not incorporate our changes, resulting in fractured support for that

    Mitigation: The project members will engage with the
    open-source software developers to ensure acceptance of our changes.



    This project depends on further development of
    open-source software. The functionality of that project has been assessed and
    is appropriate. There is a risk that the code base of the project is more
    difficult than anticipated to develop in support of our intended

    Mitigation: In most cases, there will be close collaboration
    between project members and the original developers of the software.



    This project proposes a distributed
    development effort. There is a risk that the communications overhead of this
    effort is greater than expected, impacting on the speed of development. We
    estimate the likelihood of this risk as low.

    Mitigation: In addition to regular video-conferences
    throughout the project, the proposal makes provisions for a face-to-face
    meeting of all project members in the Initial Phase and 4 visits by the
    Project Manager and Product Owner to the 4 main centres (Sydney, Melbourne,
    Canberra and Perth).



    possible risks include:

    • Ability
      to manage and arbitrate data access, including privacy concerns

    • Impact
      on a lack of standards, or tools being integrated using non-standard formats

    • IP
      position for tools being integrated

    • Competing
      research project requirements, making it difficult to arrive at a consolidated
      position during the execution of the project

    Approach to Risk Management

    The HCS
    vLab project will maintain a risks register, including the impact, likelihood
    and mitigation strategy for each risk, as it is identified. The risks register
    will be a part of the regular reporting to both the governance body and NeCTAR.
    For technical risks, risk level is
    accounted for in prioritising user stories – more risky stories are,
    other things being equal, attempted before less risky stories. For staffing-related risks, use of Intersect
    as a development partner with significant capacity ameliorates the risk of
    departure of key staff. For external/dependency
    , responsibility for identification and management of the risk profile
    of the project rests with the Project Manager.


    B.18    Standardisation
    and Interoperability     

    There is a significant move towards
    interoperability of language resources and tools in the international arena as
    national and international groups collaborate more frequently on large scale
    projects. There are three primary areas of focus for standardisation that are
    relevant to this project: meta-data, annotation standards and tool

    Meta-data standardisation has long been
    a focus of the language resources community with well established standards
    such as OLAC and IMDI used to describe resources in various
    meta-data repositories. More recent efforts have established the ISOcat
    data category registry that is used to register vocabularies used at various
    levels in language resources. A significant EU effort just getting underway is META-NET
    (Multilingual Europe Technology Alliance) which aims to establish a platform
    for resource sharing around Europe and has already made significant
    contributions relating to meta-data management. The current ANDS funded Australian National Corpus project is developing a
    hybrid meta-data model suitable for describing language corpora which can be
    exported to the ANDS Research Data Collection via the RIF-CS
    standard vocabulary. The HCS vLab will make ARDC-compliant use of RIF-CS at the
    collection level, through automatic export of selected data set descriptions to
    ARDC as well as by allowing import of data that others have ‘advertised’ on the

    The standardisation of annotation
    formats and the semantics of various annotation schemes has been a focus of the
    ISO TC 37 (Terminology and Other Language and Content Resources) working
    groups. This group has established, for example, standards for Morphosyntactic
    annotation (MAF), a Lexical Markup Framework for use in dictionary-like
    resources (LMF) and an interchange format for annotation (LAF/GrAF). These
    formats and standards are now gradually being adopted by projects around the
    world and will provide for increased inter-operability between language
    resources. Project member Steve Cassidy is an active member of a number of ISO
    TC 37 working groups and was recently invited to help establish a working group
    to standardise query languages for language resources.

    Internationally: A number of projects in the EU and US aim to develop standard
    web-service architectures for defining and managing work-flows that process
    audio or textual resources using tools such as parsers, taggers, speech
    recognisers etc.  One important
    project that we are associated with is the US NSF funded project “The
    Language Application Grid: A Framework for Rapid Adaptation and Reuse”
    by Nancy Ide and James Pustejovsky which names the Australian National Corpus
    as one of a number of international collaborators.  The HCS vLab will work closely with these
    international partners to ensure that we are building compatible and
    interoperable toolsets.

    In Australia:

    • NICTA will contribute to provide support
      for interoperability with UIMA, the Unstructured Information Management
      Architecture, an emerging standard for wrapping components for processing
      language, speech, video, and other unstructured data. Dr. Verspoor (NICTA) was
      a member of the OASIS standard committee for UIMA and is an editor of the
      published standard.

    • The Humanities Networked
      Infrastructure (HuNI)
      is a NeCTAR Virtual Lab established in the first round of funding.
      HuNI is primarily concerned with developing sustainable, relevant and enabling
      infrastructure for Australian humanities researchers and cultural custodians
      and it involves researchers and custodians working with cultural datasets. There
      is some overlap in the kinds of data being managed by each network and we
      believe that there is significant scope for HuNI to benefit from some of the
      workflows that will be developed within HCS vLab. We will work closely with
      HuNI to ensure that data can be exchanged where appropriate to maximise the
      value of this infrastructure to the audiences of both virtual laboratories.

    • La Trobe
      is making
      available the Humanities and Social Sciences Visualisation Laboratory (VisLab).
      VisLab allows for the remote use of scientific instruments and imaging of
      scientific data, creating a capability for interactive research collaboration,
      visualisation and imaging. For instance, the city of Melbourne is a research
      hub in the phonetics of endangered languages, yet the largest speech research
      facilities are at the University of Western Sydney and Macquarie University in
      Sydney. Using the VisLab, phonetics staff members and PhD students in Melbourne
      would have access to expensive equipment such as the 3D Carstens EMA machine at
      UWS (worth approximately $100K), without having to travel to Sydney for the
      fiddly and time-consuming trial-and-error stage of data acquisition. Such
      virtual access will save resources in the long term, since errors are less
      likely in the data collection stage if there has been sufficient time for




    B.19    Service

    The service levels are set out in Table 4. Please refer to B.20 for
    further details of these services.

    Table 4: Service level that will be offered for each service


    Service level

    Application Hosting

    24×7 365 days per year, with
    99.9% availability

    Business Services

    For automated workflows: as
    for Application Hosting, above

    For manual workflows:
    initial response next business day

    Virtual Service Desk

    Initial response: next
    business day

    Application Support

    Initial response: next
    business day

    B.20    Operations
    and User Support      

    With reference to the services in Table 4, the following
    operations and user support will be provided:

    Application Hosting

    Intersect will provide the application hosting support
    service for the HCS vLab application. Each service hosted by Intersect has an
    individual who is ultimately responsible for the provision of that service
    – the service owner. Services hosted by Intersect are monitored, and
    backed up, and user support is provided during business hours. All services are
    automatically monitored for availability using a ‘shallow monitoring’ approach
    (“Is the service alive?”). Where appropriate, services are monitored using a
    ‘deep monitoring’ approach (“Is the service responding sensibly?”). Unexpected
    outages are:

    • published via a mailing list – the makeup of
      the mailing list being maintained by the service owner;

    • escalated to the systems administration team.

    outages, including for upgrades of software and hardware, are requested by the
    systems administration team. These outages are publicised and negotiated with
    the user base of the service by the service owner.

    Business Services

    The following business services have been identified:

    • User Registration, Research Data Annotation
      Service, Tool Execution Service, Federated Search, Corpora Management and
      Commissioning Service, Tool Management and Commissioning Service, Workflow
      Capture Service, ANDS Publication Service.

    • UWS Marcs will be responsible for servicing
      these requests. The project is aiming to implement a self-service model.
      However, it is likely that not all services will be fully automated; some
      services may require human co-ordination. The requests will originate from
      application workflows. Service levels are as specified in B.19.

    Virtual Service Desk

    UWS Marcs
    and Intersect will jointly maintain a virtual service desk for users. Intersect
    will provide a single point of access for end-users of the service to lodge and
    track issues and other service requests. UWS Marcs will provide tier-1 (first
    response and triage) and tier-2 (business workflow and problem resolution)
    support to users. UWS and Intersect will provide tier-3 (software defect
    resolution) support for the vLab framework proper, whereas the tools themselves
    will be supported by contributing project partners. Jira will be used to track
    all issues and service requests. UWS is committed to support the service desk
    until 2014.

    Application Support

    The application support service (e.g., account creation and
    access control, troubleshooting) will be carried out by UWS Marcs tier-1
    support. In addition to Virtual Service Desk, UWS Marcs and Intersect will
    provide learning and development (L&D) and outreach programs. Support for
    the project during its operation will include:

    • Development
      and delivery of training modules for the use of the outcomes of the project
      through UWS’ and Intersect’s learning and development activities,

    • Publicity and on-going assistance with the outcomes of the project,
      through Intersect’s outreach activities. This work is distributed amongst the
      team of 9 research analysts.

    B.21    Sustainability           

    The HCS
    vLab will be operationalised throughout the course of the project and is
    expected to attract a broad and active user group both within the research
    communities, and by external users. After the project is completed, UWS Marcs,
    with its partners, will take responsibility for supporting and hosting the HCS
    vLab. As researcher requirements are expected to be diverse with new corpora
    and new tools becoming available, there will be strong drivers in place for
    further development and improvement to occur. UWS Marcs will:

    • Explore opportunities to
      expand the HCS vLab service desk.

    • Negotiate with its partners
      to secure additional funding for software development ($150,000 has already
      been allocated).

    • Engage and influence UWS to
      provide further operational and application support for the HCS vLab.

    • Explore opportunities to
      develop a transactional cost model for external parties to gain access to the HCS

    • Explore opportunities to
      develop a data mining and analysis service for external parties.

    • Apply for further
      infrastructure grants to expand the HCS vLab further.

    • Facilitate the development
      of training courses for adding tools and corpora to the HCS vLab –
      on-line and face to face.

    • Commercialise the system
      and make it available to new and emerging research centres and external
      commercial enterprises for a fee. UWS would commit to reinvesting funds back
      into the system and establish a product user group.

    • Hold conferences and forums
      to further the concept of the HCS vLab. Resulting funds reinvested.

    are already very strong links between the HCS vLab partners and the research
    community (see B.2) and it is expected that the infrastructure built will
    encourage the integration of future databases and analysis tools. Several
    partners are already planning to add corpora and tools to the HCS vLab (see B.9
    and B.10). In addition, A/Prof. Drew Khlentzos, from the Language and Cognition
    Research Centre at UNE, has proposed the integration of a new logic database
    comprising core logical principles governing the main logical operators that
    are expressed in most (if not all) of the world’s languages, which should be
    available for incorporation into the HCS vLab within the next 2 years.

    B.22    IP,
    Licensing and Access     

    Intersect currently makes software available under an
    open source licence. Intersect’s IP in NeCTAR projects will be made available
    to Australian publicly funded researchers under the same conditions.

    All the corpora and tools in the HCS vLab will be
    available to researchers and users under appropriate licensing, according to
    guidelines published from the Australian National Data Service.

    B.23    Communications
    and Engagement

    are three periods during which customer satisfaction is solicited by Intersect.

    • During
      the concept and elaboration stages (see B.15) the governance and stakeholder
      groups represent the customer. They are engaged frequently and in depth through
      this phase by the developer, Intersect. Project plans and quality management
      plans are reviewed by the group, as are the initial list of user stories that
      define the initial direction of the project. There is no separate process by
      which satisfaction is measured – it emerges from the collaborative nature
      of the project.

    • During
      the development and deployment stages (see B.15), again the governance and
      stakeholder groups represent the customer. The stakeholder group is engaged via
      sprint-end demonstrations and end-user testing, and can affect the course of
      the project during sprint planning. Rather than attempt to measure
      satisfaction, Intersect will engage continuously, and allow the customer
      regular opportunities to effect change in the project’s direction. The Project
      Manager reports to the Steering Committee monthly. The governance group is
      invited to provide input at that point, including their level of satisfaction.

    • After
      deployment of every project carried out by Intersect, the Development Team,
      Product Owner and Project Manager conduct a ‘lessons learned’ forum.
      Representatives from the Steering Committee can attend and provide input. At
      this forum, the outputs of the project and the process by which they were
      derived are critically analysed.

    After 6
    months, the governance group are consulted for their satisfaction with the
    process and outputs of the project. Intersect uses a standard form to solicit
    the group’s input. Formal channels, as outlined above, go some way to measuring
    customer satisfaction. Intersect also makes active use of a variety of informal
    channels to solicit feedback, including consulting DVCRs or PVCRs and CIOs of
    universities making use of the service, attending university eResearch
    committee meetings, and through the outreach conducted by eResearch analysts.
    These informal channels provide an important complement to the formal ones.

    engagement under Intersect’s development model requires that the stakeholder
    group is broadly representative of the eventual end-user community. Further,
    the model requires a significant investment of time from members of the
    stakeholder group over the period of the project.

    B.24    Constraints
    and Dependencies    

    External Party

    Capability Required

    Date first required

    Milestones or Deliverables dependent on that capability

    Seeding The Commons Project SC20

    Data Catalogue















    and AusNC Inc.

    and Support for integrating the Aus NC corpora and tool sets