What is data?

Data is the currency of Science. But what is data? Data is another word for result, fact, information. Data helps not only scientists, but any learning individual will rely on data to generate “knowledge”. Let it be a baby learning to speak, a toddler learning that sharp things sting by trial and error, or Facebook or Google generating recommendations to you based on your screen time report, data is a raw material to analyze.

If you ever read or watched any Sherlock Holmes book, movie or series, you will remember that he gathers data from all his observations and makes really strong deductions and leaves everybody speechless, solving murders and other crimes. This is the way his mind works, and for selling purposes, it is very interesting to watch. This makes me think: what do we do with all the data that is available to us? Do we use it, very punctually as Sherlock, or do we perceive it as a lot of noise?

What do we do with data?

If I say: “I really like winter”, and you say: “My mom really likes winter”, am I your mother? This logical reasoning is called syllogism (a deductive reasoning that based on given propositions, may hold true or not). With the data that I have provided you, it may sound plausible, huh? I am pretty sure I am not your mother 😉. The mere fact of having data, does not mean that we can analyze and conclude on the thing that makes the most sense to us. When we conclude on statements like the one above, we are observing a fallacy (fallacies can occur in syllogisms, if you want to learn more about fallacies, please click here). Fallacies are the enemy of Science! Now, check out the following graph:

Generated from: http://tylervigen.com/view_correlation?id=906

If we plot the divorce rate in Texas and the consumption of margarine (per person) in the US between 2000 and 2009, we see that both go down as time passes. This is data: a lot of information. Are people staying married because margarine intake in the whole country decreased? Or even worse….what if people are staying married and this made the whole country eat less margarine? When two events seem to be related in the same or opposite direction, we are observing a correlation (if you want to learn more about correlations, please click here).

Interesting: Hypothetically, to address whether margarine does provoke divorce we would need to set up an experiment that is a bit more complex. We would need to recruit a vast number of married couples as an appropriate sample that represents the entire population of interest (age, socioeconomic status, dietary pattern, etc.) and expose them to margarine. We would need to have a control group of married couples that did not receive any margarine at all, and follow them over a long period of time (let us say 20 years). Then, with the data obtained from this study, we would get an indication of causation. Of course, the cause for couples getting divorced may be many other reasons, so we need to take that into account when setting up the study and analyzing the data. What do you think? Do you agree? Please let me know in the comments section!

It may not be feasible but this would be an approach to assess causation. As you can see, it is very important for data to be correctly analyzed and interpreted. If you are still curious about interpretation of scientific data, please read the last post of @erikapinheiromachado

The way we analyze data has evolved: Programming languages

Different types of data have arisen, such as the whole human genome, and levels of expression of all human genes and many other species, demographic data, data about the COVID-19 pandemic, clinical trials, and so on. We are not merely plotting graphs with two variables and seeing how the graph looks anymore. We have scaled up our analyses to a whole-new level, take this example:

People talk about supercomputers, and yes, they do exist. However, most of these analyses can be performed in our everyday computers. This is because of two things: first, computing power in our everyday-devices has improved, and second, some heavy analyses can be performed remotely with our computers handling the code and the computation occurring in another server. One of the most common examples for this high-order analytic tool is through a programming language. A programming language is as its name says: a language. It possesses vocabulary, grammar, and syntax. A lot of statistics of the so-called Big Data are performed in a programming language called R.

Disclaimer: There are many other programming languages, such as Python, Java, C++, etcetera. We will focus from this point on R, which is the one I have more contact with. I am not an expert, but this post aims to introduce it as a resource for whoever is curious about it and wants to learn more.

In fact, some of the data generated in my PhD project have to be analyzed with R. While a lot of people say that R is one of the easiest programming languages, I have to admit that as any language, it is hard at the beginning and you never finish learning it. Outside of academic research, big institutions and companies use R to assess trends and do predictions based on the data they gather, and to perform data-driven decision making for important issues. You can read more about R applications here.

R as a data analysis tool has been subject of a feature article in Nature magazine!
You can read about it here.

I thought of how important is becoming data science in current academic research. Some months ago I took a couple of courses to make myself comfortable with handling R. The course was taught by Theo van Mourik, MSc – from the Center for Information Technology (CIT) – University of Groningen. I approached him to ask him some questions that may clarify some of the doubts about R that anybody would have. The following is our conversation:

The following is an excerpt/transcript from a conversation with Theo about the aspects of learning a programming language.

Interviewing Theo.

I compare learning R to learning Polish. It is not necessarily hard, but you need to memorize a lot of vocabulary and grammar rules before you can start to speak“.

What is your relationship with coding?

I am a psychologist by training, initially I started with Excel. Six years ago I started to get involved in programming, I started with SPSS (a more specialized statistic analysis program) and realized that SPSS was not getting more popular, but R was increasing its usage. I started at CIT one year ago (although I’m not sure if that is relevant). I would (but never did) teach an SPSS course, but fewer and fewer people enrolled for that course. In the meanwhile R was on the rise so I learned it and started teaching that.

What kind of things can R do that Excel cannot do?

R is more adaptable, since it is a programming language. If you can think of what you want, you can do it in R. Of course, it is going to take you some more time at the beginning to create a “graph” than a couple of clicks on Excel or SPSS. But an advantage is that once you write your piece of code, it is reproducible. So you can run your code with multiple datasets and it will always behave the same. In Excel or SPSS you can do many things but they are not as adaptable as with R. 

Another good thing about R is that you can download packages of functions to work with. These R-packages are easy to download and are continuously updated by developers/users like you and me. The R packages have improved over time, they only get better!

Was anything particularly hard when you were learning R?

I compare learning R to learning Polish. It is not necessarily hard, but you need to memorize a lot of vocabulary and grammar rules before you can start to speak. At first, the “hardest” thing is that you need to learn a lot of words. So for the vocabulary, there is a steep learning curve. And also, when you start in R, you work with data frames, which is basically just a table. Why do we need to keep the jargon so complicated? It is just a table! But people use the term data frame, so we have to learn it. Another thing, is that you have to be very specific on what you want R to do, otherwise commands will not run. Look at this comparison:

English: How are you Tom?
Polish:  jak się masz Tom?
R:       Feeling <- Tom[Tom$day == date(),”mood”]

How long did it take you to learn R?

I would say that it took me 2 weeks “full-time”. Of course I could not focus full time, because I had other occupations, but I would say that with full dedication, you can achieve decent outcomes in around 2 weeks or less. 

What do I need to do if I want to learn a programming language, such as R? 

This may not sound impartial because I teach the course on R in the CIT Academy, but starting with a well-structured course is always a good thing. You can always watch YouTube tutorials and videos, but we focus in a teaching style that leads to learning and if you think about it: psychologically, when you enroll a course, you set your mind to dedicate time and attention to such task. Also, if you have a question about programming, you can send us an e-mail to citservicedesk@rug.nl and we can focus and support any question that comes to us. This is an important aspect when you start learning, you need a coach or someone that can go through your questions and help you learn. 

Are these online courses open to people outside of the University of Groningen? Let us say, I will share this post with people in Mexico, would they be able to join the R course?

These courses are open to anyone, students, employees and external attendees. They can always enroll. I would really like to know what do people outside of this University do with their data analyses. Please share this info with them! I think we can learn a lot from them.

How can I and other people that are already initiated in R, ask for help on specific packages or applications (biology, psychology, statistics, epidemiology)? 

We are setting-up an online R-‘community’, where we can form a database of R users and their field of expertise so that if we receive a very specialized question about gene sequencing such as biology, we can connect you with an expert or user that has done it and maybe he can help you. We want to setup some functions and activities for the R-‘community’ such as:

  • Helpdesk function: just send us your questions or your code by e-mail. – Already running
  • Courses and coaching on R – Already running
  • Creating R scripts for you (tailored coding) – Already running
  • Network to redirect for specialized questions
  • Network to learn of packages that are gaining popularity (trending)
  • Perhaps in the future we can send a newsletter with tips, tricks and updates on R

Just send a message to citservicedesk@rug.nl and we can start from there!

From the SciFact team: A big thank you to Theo van Mourik , for the time and help provided!

One thought on “Data of Science and Science of Data

Leave a comment