SOUTH AFRICA

Data-intensive research capacity boosted ahead of SKA
A consortium of institutions in South Africa has been formed to establish a Western Cape Data Intensive Research Facility as part of the country’s National Integrated Cyberinfrastructure System. The aim is to dramatically increase data-intensive research capacity ahead of the global astronomy research initiative, the Square Kilometre Array.This follows Department of Science and Technology approval of the facility last month.
The consortium is led by the University of Cape Town and includes the University of the Western Cape, Cape Peninsula University of Technology, Stellenbosch University, the Square Kilometre Array – SKA – project, and the new Sol Plaatje University in Northern Cape province, close to the location of the MeerKAT telescope and the site for SKA.
It will establish and operate a data-centric high-performance computing facility for data-intensive research focused primarily on the priority research challenges of astronomy – with a particular focus on the SKA project – and bioinformatics and related clinical research.
“With the completion of MeerKAT in 2017, we will have in place the first elements of the SKA project. The SKA is a global mega science project that drives one of the world’s largest data challenges in the coming decades,” says Professor Russ Taylor, SKA research chair at the universities of Cape Town and the Western Cape, and director of the Inter-University Institute for Data Intensive Astronomy.
The SKA project will build the world’s largest radio telescope, 50 times more powerful and 10,000 times faster than any other. SKA will be constructed in Western Australia and the Karoo in South Africa; eight other African countries will host antennae, and there will be SKA activities in around 20 countries.
Rising to the big data challenge
“Having won the SKA bid to co-host with Australia, South African research-intensive universities find themselves at a point in time where they must rise to this big data challenge if South African researchers are to play a leading role in the SKA science enterprise.”
Otherwise, Taylor adds, they will be shipping data offshore to be processed and analysed by scientists in SKA partner countries.
“Essentially countries that solve the data challenges of the SKA will be global leaders in data science, and will reap the societal benefits of resulting big data innovation and expertise. The Data Intensive Research Initiative of South Africa [DIRISA] will be a focal point of big data innovation for the SKA in South Africa,” Taylor explains.
In biomedical and other biological research, high-throughput technologies and genetic mapping technologies are being deployed, driving a growing demand in bioinformatics for expertise and facilities for big data management, storage and analysis.
Researchers in South Africa are beginning to use these techniques to study genetic clues to disease and treatments, and large datasets must be sequenced in projects across the country, including for the Southern African Human Genome Project.
This kind of analysis is just one small example among a myriad of other applications of bioinformatics tools to big biological data analysis. “We need to build South African capacity to deal with these data sets.”
A tiered system
On the consortia becoming operational, Taylor states that while the project has been approved, the funding has not yet been released. However, several elements of the DIRISA facility are building upon initiatives that are already in the early stages of development.
The infrastructure to host the facility has been established at the University of Cape Town. Once the funding is in place, the facility will be up and running within about four months. It will serve as a regional node of the greater cyberinfrastructure system, which will include national (Tier 1), regional (Tier 2) and institutional (Tier 3) infrastructure.
The tiered infrastructure model for the envisaged South African DIRISA network includes a Tier 1 national facility – the current Centre for High Performance Computing is an example. Tier 2 facilities are either ‘regional’, serving a region of the country, or set up to serve the national community but specifically in special strategic science areas.
“Our Tier 2 is the latter. We are setting up a facility that will serve a national community of researchers in astronomy and bioinformatics research. It will be hosted in the Western Cape and operated by the consortium. Tier 3 facilities are computing facilities that serve a single institution, for instance one university,” Taylor explains.
There is no international tier in this system. However, Tier 2 facilities will engage in projects that federate their centres with others around the world that are being set up to serve national communities on the SKA project.
In particular, there will be an agreement with a European team led by the Netherlands to cooperate on federating data-intensive research facilities into a global operation. The regional science and data centres, Taylor says, will explore technologies and systems that will prototype a global network of data-intensive research facilities that will be used for SKA. This will be one of the key projects of the new facility.
‘Democratising’ big data
Scientists are working on developing innovations in Cloud technologies as a means to provide wide access to software and computing resources and to bring together distributed resources. This is meant to ‘democratise’ big data by providing the tools to allow individual researchers to work with very large data sets.
Portals and platforms to be developed will allow researchers to access the facility and the big data tools through a web-based interface. Basically, anyone with access to the internet will be able to use the facility and access the big data.
“We will undertake research and partnerships with global organisations that are developing policies and processes for sharing and access to research data. Big data offers unique challenges,” Taylor states.
There are plans to collaborate on training in Africa. Also, since the new consortium is largely made up of universities, the data-intensive research facility will be closely linked with training in data science at the partner universities.