Message from NSTF Executive Director

Data – big and small


Big data is already here

The reason we are living in the era of the 4th Industrial Revolution is that computing power has become mind-bogglingly massive. Never before has it been possible to store so much data and to process data so fast as now. Storage and particularly processing capacity make it possible to develop 4IR technologies like machine learning and artificial intelligence, Internet of Things (IoT), driverless cars, robotics, large scale 3D printing, etc.

There are many questions that flow from the increasing applications of these technologies. What should happen to realise the full potential of big data sets? How can it be ensured that data is safely and efficiently stored and transferred? What about the management, curation and governance of big data sets? What does big data mean for researchers? Do we need laws to regulate the use of these technologies? What about ethical behaviour?

Already Google and other service providers offer services to deal with big data owned by customers. It is difficult for individual companies to store and analyse data when the datasets become too large. These tools of the 4IR are already available and not too difficult to use (assuming the budget is also available). I can relate the example of Google’s service called BigQuery, because the NSTF Brilliants students were treated to a workshop by Google in 2017, and I was allowed to accompany them. (Apologies for this bit of free marketing).

The data is (safely) stored by the service provider under the customer’s name. The customer can submit questions (queries) that they want answered from their big dataset online, and Google (e.g.) provides such answers. The customer controls access to the data and the project they are working on and can give others access as needed.

Google Cloud IoT also provides a platform for intelligent IoT services. It is “a complete set of tools to connect, process, store, and analyze data”. (There are probably other companies offering this service). One can run IoT solutions with machine learning capabilities. Thus, Artificial Intelligence (AI) can be extended to the customer’s devices/assets linked by IoT. Assets of your company can be tracked wherever they are on the globe.

The service provider therefore provides the 4IR technologies and removes the customers’ burdens of owning expensive equipment, having the relevant expertise inhouse and analysing the data. If these solutions prove to be safe, it is empowering to non-experts in data management and analysis. In that way they can do their own work more effectively.

Big data should change things for the better

Addressing delegates at the annual conference of the Centre for High Performance Computing (CHPC) on 3 December 2018, the Director-General of Science and Technology, Dr Phil Mjwara, said that the core of the 4IR was the emergence of cyber-physical systems, based upon our ability to collect massive amounts of data, manipulate and analyse them efficiently, and transfer them fast and securely. The following is quoted from the media statement issued by Government:

As digitisation disrupts society ever more profoundly, concern is growing about how it is affecting issues such as jobs, wages, inequality, health, resource efficiency and security.

“While these are extremely complex challenges, analysis suggests that digital transformation has the potential to make a positive contribution, if we can get certain things right,” said the DG.

The Director-General said South Africa needed to create a workforce for the machine age, build trust in the digital economy and encourage a paradigm shift in industry.

Through 4IR technologies, it should be possible to fast track the finding and implementing of solutions to the major problems South Africa faces – healthcare; electricity supply; water purification, distribution, testing, and allocation; parts of education and educational resource provision; training of all kinds; keeping track of government finances, tightening up systems and processes to curtail corruption, etc.

All of these potential solutions depend on two things – the political will to do what is needed, and data – lots of data. South Africa does not always collect data systematically. There are gaps in our data sets, and data sets can be skewed in terms of population groups or geographic areas. It is time for South Africa to take data seriously, and to put systems and processes in place all over government to collect data consistently and store it safely. A case in point is keeping track of police dockets – perhaps it would be harder to ‘lose’ dockets if the system was electronic from start to finish, and there were checks and balances built in?

Other African countries face the same challenges and potential solutions. The Conversation carried an article on ‘Why fixing Africa’s data gaps will lead to better health policies’ on 26 February 2019

Achieving the Sustainable Development Goals (SDGs) and the goals of Africa 2063 also depend on how well we can monitor attainment of the indicators. The better the data sets, the more accurate this monitoring will be, and the better countries can manage their progress.

A report of The Center for Strategic and International Studies (CSIS) by Erol Yayboke, Deputy Director and Senior Fellow, Project on Prosperity and Development, Project on US Leadership in Development, says in its Executive Summary:

Functioning societies collect accurate data and utilize the evidence to inform policy. The use of evidence derived from data in policymaking requires the capability to collect and analyze accurate data, clear administrative channels through which timely evidence is made available to decisionmakers, and the political will to rely on—and ideally share—the evidence. The collection of accurate and timely data, especially in the developing world, is often logistically difficult, not politically expedient, and/or expensive.

This report defines the data revolution as an unprecedented increase in the volume and types of data—and the subsequent demand for them—thanks to the ongoing yet uneven proliferation of new technologies. This revolution is allowing governments, companies, researchers, and citizens to monitor progress and drive action, often with real-time, dynamic, disaggregated data. Much work will be needed to make sure the data revolution reaches developing countries facing difficult challenges (i.e., before the data revolution fully becomes the data revolution for sustainable development).

Many outside the developing world are considering the endless possibilities presented by “big data.” For many in the developing world—especially those in statistical agencies and other entities responsible for data collection, dissemination, and analysis—big data is not even on the radar. These governments face enough challenges to utilization of “small data” and evidence more broadly in policymaking.

(Michelle Mbuthia, a Communications Officer at APHRC and Caroline Kabaria, a Postdoctoral Researcher at APHRC contributed to this article).

What does big data mean for scientific research?

Big data and the means to mine and analyse it have been useful for scientists for a long time. Large bioinformatics datasets make it possible to record and safely store data on plants and animals, many of which are expected to become extinct with climate change. The increasing processing and storage power of computers have made it faster to describe the genomes of humans and other species, and ever more comprehensive and meticulous records can be kept.

The potential for radio astronomy to make discoveries in space on a vast scale is made possible by ever- increasing computing power. South Africa has been working on solutions for the storage and analysis of radio-astronomical data gathered by the Square Kilometre Array (SKA). Only the first phase of the SKA, namely the MeerKAT in the Northern Cape, has been completed, and already the collected data is of a baffling magnitude.

At the annual conference of the CHPC, Dr Phil Mjwara said that South Africa has made significant investment in cyberinfrastructure over the past 10 years. The following is quoted from the media statement issued by Government:

The establishment of the National Integrated Cyberinfrastructure System (NICIS), managed by the CSIR, is central to this investment. Its pillars are the Data Intensive Research Initiative of South Africa, the South African National Research Network and the CHPC.

Through NICIS, South Africa is growing a robust, competitive and sustainable national platform for cyberinfrastructure provision, research and innovation, and human capital development. Its main purpose is to provide high speed connectivity, data storage and management services for researchers to be able to perform cutting-edge research and remain competitive.

Dr Mjwara said South Africa was at the forefront of the development of the first real-life implementations of quantum cryptography, and was now spearheading the field of quantum machine learning.

South Africa, therefore, is not doing badly in engaging with 4IR technology and research. It is remarkable given our country’s limited resources and challenges.

A recent editorial in Nature (27 February 2019) demonstrated how sharing experimental data, and mining it to study particular characteristics can lead to remarkable discoveries. In this case recent research papers reported on discoveries in materials science. Algorithms were developed to scan through databases of the crystal-structures of non-magnetic materials. They discovered that about 25% of the materials could be considered ‘topological’ – meaning they have “unusual states at their surfaces or edges that are caused by the geometry of their electronic structures”. Such materials could open up opportunities for materials engineering, such as energy-efficient transistors and circuits. These findings are theoretical at this stage, as it has to be determined if all the materials can be synthesised, and whether they will indeed have the predicted characteristics. However, this remains a significant discovery, made possible by mining of big data.

Topological catalogues are still in the early stages of development, so this community would do well not to miss the opportunity to push for widespread and standardised sharing of experimental data.

The NSTF-South32 Awards include a category for Data for Research. Its full title is: NSTF Award for an outstanding contribution to science, engineering and technology (SET) to an individual or a team by advancing the availability, management and use of data for research.

The NSTF, under the guidance of a team of experts through the Network of Data and Information Curation Communities (NeDICC), has been making this award for three years to date. It is meant to encompass the work of an individual or a team (including for example researchers/scientists, data scientists, data stewards) to be rewarded for the generation, preservation and sharing of a valuable scientific resource in the form of a data set or data collection process for a data set. It should be of national interest or for the public good, and openly available to be re-used and/or re-packaged in products that are of public good and interest, or that could be integrated into products that contribute to the development of South Africa.

Big cybersecurity

At the NSTF’s discussion forum on the 4IR in September 2018, the director of the Cyber Security Institute, Prof Elmarie Biermann, reminded us of current cyberthreats and the need to ensure that our personal technology devices are safe from hacking and identity theft. I had invited her because we had addressed issues of cybersecurity and safety at previous discussion forums, and it wasn’t clear to me how we would cope with the 4IR if we already cannot be safeguarded against cyber attacks.

Prof Bierman did not disappoint. She said that on a national level it could amount to cyber-warfare, and the military should be fighting and preventing threats that have far-reaching consequences.

She said that the world is moving towards the IoT in which computing devices will be embedded in everyday objects, enabling them to send and receive data. Shodan is the world’s first search engine for Internet-connected devices. Shodan can be used to discover which devices are connected to the Internet (e.g. webcams and printers), where they are located and who is using them. Devices that are connected to the Internet have an IP address that could provide access to hackers and criminals. The default settings of these devices should be checked, and logins and passwords should be regularly changed.

Criminals operate as businesses, employing researchers, software developers and others who may not even realise that they are part of the criminal system.

South Africa has a shortage of skills to deal with cyber-security. The universities have not caught up with respect to the importance of security, and the country has a great shortage of skilled people in that environment. Security is a central issue with widespread ramifications. South Africa has not yet experienced an attack on critical infrastructure such as the water or electricity supply system, so there is a lack of appreciation of the possible consequences. The implications must be considered more broadly, and better understanding of the cyber-security environment needs to be developed.

Big ethical issues

Nesta is the United Kingdom’s innovation foundation and runs a wide range of activities in investment, practical innovation and research. A blog on NESTA’s website written on 21 February 2019 by Geoff Mulgan and Vincent Stroub: The new ecosystem of trust – How data trusts, collaboratives and coops can help govern data for the maximum public benefit, raises important issues regarding population data:

The world is struggling to govern data. The challenge is to reduce abuses of all kinds, enhance accountability and improve ethical standards, while also ensuring that the maximum public and private value can also be derived from data.

This paper argues that new institutions—an ecosystem of trust—are needed to ensure that uses of data are trusted and trustworthy. It advocates the creation of different kinds of data trust to fill this gap.

Then there are the issues of fairness and representivity in datasets:

Prof Tshilidzi Marwala, Vice-Chancellor of the University of Johannesburg (UJ), is well-known for his inspiring talks, his work on machine learning, economic modelling using AI, methods to deal with missing data, AI and rational decision making (among other subjects), as well as positioning UJ as the university to gain skills for the 4IR. At the opening in December 2018 of the Science Forum South Africa and at the Department of Science and Technology’s Science, Technology and Innovation Summit in November, he called on Africans to create their own databases “so we are not excluded in the revolution of science”. He said that machines responded to the data fed into it.

In his recent article published in the Sunday Independent (21 January 2019), he explores the impact of machine-learning algorithms. He points out that data goes where the money is. Data on travellers e.g., are collected especially from the USA, Europe and China, and only to a lesser extent on Africans. He proposes guidelines for ethical principles that should govern technology:

“…from an economic perspective these machine-learning systems have to be trained on economic unreality of economic equality for them not to discriminate. If they are trained on economic reality of economic inequality then discrimination is inevitable. How do we untangle this dilemma between reality and unreality, which, respectively, leads to discrimination and fairness?

Firstly, we need to understand that technology follows the characters of its makers. If its makers create technology without regard to human safety, then it can easily become a danger to society. When the Nazis created technology with the intention to murder people because of their race and religion, the result was genocide. When Dr Wouter Basson created technology with the sole purpose of murdering people, the result was death of innocent people. Therefore, it is important that we ensure primarily that technology is regulated to protect people. The first principle that we should adopt as far as technology is concerned is that it should not kill or harm people.

The second principle we should adopt is that technology should not go against the principles of human rights and dignity. The concept of discrimination whether done by humans or intelligent machines is against the Universal Declaration of Humans Rights. For us to enforce this principle, we should adopt an additional principle that ensures that economic interests should not supersede the principle of human rights…

The third principle is that we should embed into technology our values. A classic example to illustrate this is a self-driving car that is travelling at 120 km/h and it encounters a pedestrian. If it can possibly only do two things, should it save a pedestrian and kill a passenger or should it save the passenger and kill the pedestrian? What should this self-driving car do if the passenger is a 60-year male person and the pedestrian is a girl who is 8 years old? To answer these questions, we need to interrogate our core values and embed these values into these self-driving cars. How about if as a country we do not have the means to design these self-driving cars, how do we ensure that we embed our values into the cars that we import?

In conclusion we should ensure that these machine-learning algorithms are designed to be fair, unbiased and are driven by the principle of fairness rather than the principle of maximisation of profit.

In this way the South African professor is teaching the world to retain its humanity while the use and capabilities of 4IR technologies are growing exponentially. It is to be hoped that the world takes notice and that new inequalities and injustices don’t develop to replace the old inequalities and injustices.


The opinions expressed above are those of the Executive Director, Jansie Niehaus,

and do not necessarily reflect the views of the Executive Committee or members of the NSTF.