Astute users may have noticed a change to the Austalk collection in Alveo in the last couple of days. We are re-ingesting Austalk into Alveo to correct some errors with the previous version of the data. This means that we removed the old version and then re-loaded all of the data into Alveo. As I write this there are 400,000 of the 800,000 items available; the remainder should load over the next day.
This new ingest will allow us to attach the annotation files to those items in Austalk that have either been transcribed or annotated phonetically. Once these are in place we’ll provide some pointers to finding and working with annotated data.
One of the errors we found with the data was that we had included some speakers that did not belong in the core Austalk collection. In some cases these were test speakers who should not have been published, but most of them were from a later accented English collection by Michael Wagner which used the Austalk protocol to collect data from a different group of target speakers.
We will make the accented Austalk data available as soon as we have that all in one place. We also have the Austalk Emotional speech collection from Julien Epps at UNSW in preparation. Finally the video data associated with the main Austalk collection will be made available as a separate collection on Alveo.
I’m pleased to report that the Alveo server is now fully restored and all services should be working again as normal. AAF login is working again and password reset emails are now being delivered.
There is some work still in progress. In particular the Galaxy server will be updated soon with some more tools for manipulating speech data. We have been building tools to support workflows involving forced-alignment with MAUS and formant tracking with the Emu wrassp toolkit. These are now mostly working and we will deploy them as soon as we can. The use of Galaxy for speech and language analysis is a new development and we are still working out the best way to build tools and chain them together. When we have some tools available we’ll invite you to experiment and provide feedback so that we can hopefully build something that is generally useful to the community.
An update on the new server deployment. The Alveo repository is now re-installed on new infrastructure at NCI Canberra. All collections are re-ingested and should be available as before but there are a couple of unresolved issues that we are still working on.
- AAF logins are not yet working so you will need to login with a username/password if you have one
- we’re not able to send mail from the server so you will not be able to get password reminders or create new accounts
Unfortunately, in combination these problems might block many users who previously used AAF login to access Alveo. We are working on both issues and hope to have them resolved next week.
The ingest of the full Austalk collection was interrupted at some point and so not all of the collection is present. We will be re-ingesting this collection this weekend (19-20 Nov) so hopefully it will be fully available next week.
One new collection is now available, MAVA is a collection of Audio-Visual read speech from a single speaker collected by Vincent Aubanel from Western Sydney University.
I will post further updates as things change.
As of this morning (1st November) the Alveo server is offline. We are currently moving the server from its previous home at Intersect in Sydney to the facilities of NCI in Canberra. We had hoped to have a seamless transition between the two services but unfortunately the new server is not quite ready.
We will bring Alveo back online as soon as possible. All user accounts and collections should be maintained.
One major addition will be that for the first time we will have the full Austalk collection on Alveo. We’ve been working on finalising this collection for some time and this is the first opportunity we’ve had to get the entire collection ingested. When the server returns you should see over 850,000 items in the Austalk collection.
When we set out to build Alveo the aim was always that it should be a repository for new collections contributed by researchers; however, the initial impetus was to get a number of older collections ingested and build the platform capabilities. New collections were added to Alveo via a back-end process that only the developers could run.
We have since worked on adding the hooks into the API to allow new collections, items and documents to be added to the Alveo repository. This extended API has now been deployed on the main system and we have extended the pyalveo library to allow scripts to be written that add new data. I recently used this facility to add the first contributed collection to Alveo: a collection of children’s speech data. This blog post describes the script that I wrote to do this by way of a bit of a tutorial on the process. Continue reading
Pyalveo is the Python module that interfaces to the Alveo web API to allow automation of any task relating to Alveo. I have just made a new version (0.5) available on PyPI, the Python package repository (you can also find it on Github). Continue reading
Hosted by the Big Data Processing and Mining Group at UWA, 13 researchers, including 6 members of the Alveo Steering Committee met for 2 days to discuss further improvements to the platform and how they will use Alveo in their own research. Several researchers and post-graduate students from UWA and Curtin University participated in the workshop and presented their own projects.
I was invited to give a presentation on Alveo and Austalk at First workshop on Sociophonetic Variability in the English Varieties of Australia held at Griffith University in Brisbane in June. The workshop, organised by Gerry Docherty and Janet Fletcher, was supported by the Centre of Excellence for the Dynamics of Language was attended by phoneticians from around the country with a keynote given by Prof. Jonathan Harrington who flew in from Munich.
Austalk is a large collection of spoken Australian English collected in the last few years at sites around Australia. When the collection is complete it will have close to 1000 speakers each with a range of recordings from isolated words to interview and map task recordings. Alveo contains most of the data and will have the complete corpus when collection and data processing is complete.
Students at Monash University used Alveo as part of an introductory course in computational linguistics this year. The students completed an assignment which used basic techniques of corpus linguistics to compare one phenomenon in varieties of English.