Recent work on Alveo now allows us to allocate a DOI for any collection held on the system. A DOI (Digital Object Identifier) is intended to be a persistent digital identifier for an electronic resource; one that can be cited in your publications.
To be able to allocate a DOI we needed to have public pages corresponding to each collection in Alveo. In the past, collection pages have only been visible to logged in users. In addition to making these pages public, we have provided the option of adding a rich text description of the collection and adding attachments to the collection page – such as images or PDF files. The result is that now, collection pages can act as the main `landing’ page for a collection and provide full documentation for future users. Since these pages are visible without login, they will be indexed by search engines and should help users find your collections.
Once public pages were available, we established a procedure through Macquarie University Library to allocate a DOI for a collection. The ultimate provider of the DOI is ANDS through their Cite My Data service. Macquarie acts as a overseer to help ensure that the DOI is long-lasting. Should the hosting of Alveo move to another institution in future, the management of the DOIs can also be transferred.
It is appropriate to issue a DOI for a collection on Alveo if the following conditions are met:
- the collection is complete and you do not envisage it changing in future
- Alveo is the main and definitive source for the data
The process to issue a DOI is manual at the moment – a collection owner can contact me (Steve Cassidy) to request a DOI. I will then liaise with them to confirm the conditions above and that the appropriate meta-data is present in the collection before issuing a DOI.
Our first DOI has been created for the MAVA Collection by Vincent Aubanel 10.4227/139/59a4c21a896a3 which Vincent has now cited in this paper: Contribution of visual rhythmic information to speech perception in noise.
If you are the owner of a collection on Alveo and would like to take advantage of this facility please get in touch.
If your collection is not already on Alveo you could also get in touch, but watch this space for news about easier ways to get your data uploaded to Alveo.
Listen to this Macquarie Dictionary podcast with interesting remarks on the Australian English accents and pronunciation from Felicity Cox, and some audio snippets from data which is available through Alveo.
Astute users may have noticed a change to the AusTalk collection in Alveo in the last couple of days. We are re-ingesting AusTalk into Alveo to correct some errors with the previous version of the data. This means that we removed the old version and then re-loaded all of the data into Alveo. As I write this there are 400,000 of the 800,000 items available; the remainder should load over the next day.
This new ingest will allow us to attach the annotation files to those items in AusTalk that have either been transcribed or annotated phonetically. Once these are in place we’ll provide some pointers to finding and working with annotated data.
One of the errors we found with the data was that we had included some speakers that did not belong in the core AusTalk collection. In some cases these were test speakers who should not have been published, but most of them were from a later accented English collection by Michael Wagner which used the AusTalk protocol to collect data from a different group of target speakers.
We will make the accented AusTalk data available as soon as we have that all in one place. We also have the AusTalk Emotional speech collection from Julien Epps at UNSW in preparation. Finally the video data associated with the main AusTalk collection will be made available as a separate collection on Alveo.
I’m pleased to report that the Alveo server is now fully restored and all services should be working again as normal. AAF login is working again and password reset emails are now being delivered.
There is some work still in progress. In particular the Galaxy server will be updated soon with some more tools for manipulating speech data. We have been building tools to support workflows involving forced-alignment with MAUS and formant tracking with the Emu wrassp toolkit. These are now mostly working and we will deploy them as soon as we can. The use of Galaxy for speech and language analysis is a new development and we are still working out the best way to build tools and chain them together. When we have some tools available we’ll invite you to experiment and provide feedback so that we can hopefully build something that is generally useful to the community.
An update on the new server deployment. The Alveo repository is now re-installed on new infrastructure at NCI Canberra. All collections are re-ingested and should be available as before but there are a couple of unresolved issues that we are still working on.
- AAF logins are not yet working so you will need to login with a username/password if you have one
- we’re not able to send mail from the server so you will not be able to get password reminders or create new accounts
Unfortunately, in combination these problems might block many users who previously used AAF login to access Alveo. We are working on both issues and hope to have them resolved next week.
The ingest of the full Austalk collection was interrupted at some point and so not all of the collection is present. We will be re-ingesting this collection this weekend (19-20 Nov) so hopefully it will be fully available next week.
One new collection is now available, MAVA is a collection of Audio-Visual read speech from a single speaker collected by Vincent Aubanel from Western Sydney University.
I will post further updates as things change.
As of this morning (1st November) the Alveo server is offline. We are currently moving the server from its previous home at Intersect in Sydney to the facilities of NCI in Canberra. We had hoped to have a seamless transition between the two services but unfortunately the new server is not quite ready.
We will bring Alveo back online as soon as possible. All user accounts and collections should be maintained.
One major addition will be that for the first time we will have the full Austalk collection on Alveo. We’ve been working on finalising this collection for some time and this is the first opportunity we’ve had to get the entire collection ingested. When the server returns you should see over 850,000 items in the Austalk collection.
When we set out to build Alveo the aim was always that it should be a repository for new collections contributed by researchers; however, the initial impetus was to get a number of older collections ingested and build the platform capabilities. New collections were added to Alveo via a back-end process that only the developers could run.
We have since worked on adding the hooks into the API to allow new collections, items and documents to be added to the Alveo repository. This extended API has now been deployed on the main system and we have extended the pyalveo library to allow scripts to be written that add new data. I recently used this facility to add the first contributed collection to Alveo: a collection of children’s speech data. This blog post describes the script that I wrote to do this by way of a bit of a tutorial on the process. Continue reading
Pyalveo is the Python module that interfaces to the Alveo web API to allow automation of any task relating to Alveo. I have just made a new version (0.5) available on PyPI, the Python package repository (you can also find it on Github). Continue reading
Hosted by the Big Data Processing and Mining Group at UWA, 13 researchers, including 6 members of the Alveo Steering Committee met for 2 days to discuss further improvements to the platform and how they will use Alveo in their own research. Several researchers and post-graduate students from UWA and Curtin University participated in the workshop and presented their own projects.
I was invited to give a presentation on Alveo and Austalk at First workshop on Sociophonetic Variability in the English Varieties of Australia held at Griffith University in Brisbane in June. The workshop, organised by Gerry Docherty and Janet Fletcher, was supported by the Centre of Excellence for the Dynamics of Language was attended by phoneticians from around the country with a keynote given by Prof. Jonathan Harrington who flew in from Munich.