--- title: "Not-Lost in translation: Getting diffusion data into netdiffuseR" author: "George G. Vega Yon" date: "January 13, 2016" header-includes: \usepackage{graphicx} vignette: > %\VignetteIndexEntry{Not-Lost in translation: Importing and exporting graph} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} output: html_vignette --- \tableofcontents # Introduction - More often that liked, one of the biggest barriers for people to use R is getting the data into it. - For this tutorial, we classify graph data as follows: * Raw R network data: Datasets with edgelist, attributes, survey data, etc. * Already R data: already read into R using igraph, statnet, etc. * Graph files: DL, UCINET, pajek, etc. - The __netdiffuseR__ package includes several options to read such data. - This tutorial shows how to use those options (functions) so that the user generates `diffnet` objects (the core of netdiffuseR). - Importantly, while very useful, `diffnet` objects is not the only way to use __netdiffuseR__. Most of the functions can also be used with matrices and arrays. ```{r Loading netdiffuseR, echo=FALSE} library(netdiffuseR) knitr::opts_chunk$set(comment = '#') ``` # Raw network data - We call raw network data to datasets that have a somewhat raw form, for example, edgelists, adjacency matrices, survey nomination data, etc. and need to be read into R. - Usually this datasets are acompained with vertex attribute data. - The issue is how to read it into R and handle it altogether. - Before start, we recommend the user to take a look at the _Data input_ functions included in the `utils` package (see `?read.table`), to the functions included in the `foreign` package (useful to read from Stata, SPSS, etc.), and to the `read_excel` function in the `readxl` package[^excelfiles] for reading excel files into R. [^excelfiles]: While there are other candidates as the `openxlsx` package, the `readxl` package has the nice feature of correctly processing the encoding of the excel files. This is specially important if you are dealing with non ASCII or UTF-8 datasets. ## Edge lists - In this section we review the function `edgelist_to_adjmat` - In general terms, an edgelist is a two column matrix or data frame in which each row represents a tie/link from one node to another. - While the basic structure of edgelists only requires having two columns (usually named ego and alter), edgelists can have more information such as intensity/weight/value of the tie, or a spell/timestamp indicating the time range during which the tie exists. - The function `edgelist_to_adjmat` supports both weights and spells. ### Example 1: Loading a static diffusion network via `as_diffnet` For this example we will use the `fakesurvey` and `fakeEdgelist` datasets. The later was been generated using the `fakesurvey` dataset, which holds survey information retrieved from 10 different individuals in two different groups. Ties in the `fakeEdgelist` dataset are valued, and its value coincides with the number of nominatios that each individual in the survey did to each other. ```{r Ex1: Datasets} # Loading the datasets data("fakesurvey") data("fakeEdgelist") ``` Taking a look at `fakesurvey`'s `group` and `id` column and `fakeEdgelist`'s `ego` and `alter` columns the user can tell that the laters have been generated by adding up `group*100` with `id`. ```{r} head(fakesurvey[,c("id", "group")]) head(fakeEdgelist) ``` We will use this information later on to verify the way the data is sorted in the resulting `diffnet` objects. To use the `as_diffnet` function we need at least two objects: a __dynamic graph__ represented as either an array or a list of adjacency matrices, each of size $n\times n$, which in our case will be $10\times 10$, and an integer vector of size $n=10$ which holds each vertex's __time of adoption__. Lets start by generating the dynamic graph using th `edgelist_to_adjmat` function: ```{r} # Coercing the edgelist to an adjacency matrix adjmat <- edgelist_to_adjmat( edgelist = fakeEdgelist[,1:2], # Should be a two column matrix/data.frame w = fakeEdgelist$value, # An optional vector with weights undirected = FALSE, # In this case, the edgelist is directed t = 5) # We use this option to make 5 replicas of it ``` As the function warns, there is an edge that had incomplete information, and further was not used to create the adjacency matrix, the edge 11. If we take a look at that edge, we will see that indeed it had incomplete information on the weight attribute: ```{r} fakeEdgelist[11,,drop=FALSE] ``` In order to address this, if we want to keep the vertex 202, an isolated vertex in the data, we need to fill that value up so that when creating the diffnet object we won't have any problem having more attributes or times of adoption that vertices in the graph. ```{r} # Filling the empty data, and checking the outcome fakeEdgelist[11,"value"] <- 1 fakeEdgelist[11,,drop=FALSE] # Coercing the edgelist to an adjacency matrix (again) adjmat <- edgelist_to_adjmat( edgelist = fakeEdgelist[,1:2], # Should be a two column matrix/data.frame w = fakeEdgelist$value, # An optional vector with weights undirected = FALSE, # In this case, the edgelist is directed keep.isolates = TRUE, # NOTICE THIS NEW ARGUMENT! t = 5) # We use this option to make 5 replicas of it ``` As expected, there is no warning. Furthermore, we have told the function that in case of having isolated vertices to keep them, as is in the case of the edge \#11 which has the vertex 202. Since we asked the function to create 5 copies of the adjacency matrix, we have a list of length 5 with adjacency matrices. Lets take a look at the first element of this list: ```{r} adjmat[[1]] ``` As you can see, the `edgelist_to_adjmat` function kept the vertices labels and included them as dimnames in the matrix.[^sparsemat] Now that our adjacency matrix has the number of elements that we expected, which actually coincides with the number of rows in the `fakesurvey` dataset, we can create a `diffnet` object: [^sparsemat]: Another thing to tell, the matrices stored in `adjmat` are of class `dgCMatrix` from the `Matrix` package, these are Column Compressed Stored sparse matrices and allows saving memory in matrices with many zeros. __netdiffuseR__ routines are based in this class of matrices. Furthermore, to have an idea of how much memory sparse matrices save, while a square matrix of size $5e4\times 5e4$ would need close to 18GB of memory using a regular R `matrix`, a `dgCMatrix` of the same size takes around 6MB. ```{r} # Coercing the adjacency matrix and edgelist into a diffnet object diffnet <- as_diffnet( graph = adjmat, # Passing a dynamic graph toa = fakesurvey$toa, # This is required vertex.static.attrs = fakesurvey # Is is optional ) # Taking a look at the diffnet object diffnet ``` ### Example 2: Loading a static diffusion network via `edgelist_to_diffnet` Following the previous example, instead of "manually" generating the adjacency matrix and calling the `as_diffnet` function, we will use the `edgelist_to_diffnet` function. The most important issue when calling this routine is to have matching ids between the edgelist and the attributes dataset. So before calling the `edgelist_to_diffnet` function we need to fix the `id` column in the `fakesurvey` dataset:[^withfunction] [^withfunction]: The `with` function allows simplifying data management in R by allowing to reference columns in a data.frame without having to call the data.frame itself (see `?with`). ```{r} # Before fakesurvey$id # Changing the id fakesurvey$id <- with(fakesurvey, group*100 + id) # After fakesurvey$id ``` Now that it is fixed, we can call the `edgelist_to_diffnet` function ```{r} diffnet2 <- edgelist_to_diffnet( edgelist = fakeEdgelist[,1:2], # Passed to edgelist_to_adjmat w = fakeEdgelist$value, # Passed to edgelist_to_adjmat dat = fakesurvey, # Data frame with -idvar- and -toavar- idvar = "id", # Name of the -idvar- in -dat- toavar = "toa", # Name of the -toavar- in -dat- keep.isolates = TRUE # Passed to edgelist_to_adjmat ) diffnet2 ``` As a difference with the previous example, here the algorithm makes sure that the ordering of the dataset and the vertices in the adjacency matrix coincide. The previous example did gave us a correctly sorted `diffnet` object, but that may not always be the case. Nevertheless, the option `id.and.per.vars` allows the user providing with the names of the variables in the vertex attribute datasets that hold the ids and time period ids of each observation, so that the function sorts the data before coercing it into diffnet objects. More on this in the following examples. ## Survey data: Nomination networks - In this part we review the function `survey_to_diffnet` - This function can use as input either a longitudinal dataset (which should be in long format, this is, one row per individual and time period), or a cross sectional dataset. - For this example we will use both `fakesurvey`, which holds cross section data, and `fakesurveyDyn`, which holds longitudinal data. ### Example 3: Cross section diffusion data We start by taking a look at the data ```{r} # Loading the data data("fakesurvey") fakesurvey ``` A couple of important remarks for this dataset. First, each individual in this dataset belongs to a different group, while this is not always the case, `survey_to_diffnet` allows accounting for this through the `groupvar` argument. Also, besides of having an isolated vertex, two individuals in the survey nominate people that neither weren't survey nor show in their groups: ```{r} fakesurvey[c(4,6),] ``` So in group one 4 nominates id 6, who does not show in the data, and in group two 6 nominates 3, 4, and 8, also individuals who don't show up in the survey. While for some researchers nominations of unsurveyed individuals may not be of importance, for some others might be. For such cases, the function has the option of either keeping unsurveyed individuals (so you would get a bigger adjacency matrix), or ignore them and keep only those who were surveyed. For example, if we wanted to keep unsurveyed individuals in the network we would need to set `no.unsurveyed = FALSE`: ```{r} # Coercing the survey data into a diffnet object diffnet_w_unsurveyed <- survey_to_diffnet( dat = fakesurvey, # The dataset idvar = "id", # Name of the idvar (must be integer) netvars = c("net1", "net2", "net3"), # Vector of names of nomination vars toavar = "toa", # Name of the time of adoption var groupvar = "group", # Name of the group var (OPTIONAL) no.unsurveyed = FALSE # KEEP OR NOT UNSURVEYED ) diffnet_w_unsurveyed # Retrieving nodes ids nodes(diffnet_w_unsurveyed) ``` A network spanning 5 time periods with 13 vertices (9 surveyed individuals + 4 unsurveyed individuals). This produces a different result when compared to the case in which me use the default behavior of the function, `no.unsurveyed = TRUE`: ```{r} # Coercing the survey data into a diffnet object diffnet_wo_unsurveyed <- survey_to_diffnet( dat = fakesurvey, # The dataset idvar = "id", # Name of the idvar (must be integer) netvars = c("net1", "net2", "net3"), # Vector of names of nomination vars toavar = "toa", # Name of the time of adoption var groupvar = "group" # Name of the group var (OPTIONAL) ) diffnet_wo_unsurveyed # Retrieving nodes ids nodes(diffnet_wo_unsurveyed) ``` Furthermore, we can compare the two diffusion networks by sustracting one from another: ```{r} difference <- diffnet_w_unsurveyed - diffnet_wo_unsurveyed difference ``` ### Example 4: Longitudinal diffusion data In this example we will use dynamic network data, this is, an edgelist with spells and dynamic attributes ```{r} # Taking a look at the data data("fakeDynEdgelist") head(fakeDynEdgelist) ``` ```{r} data("fakesurveyDyn") head(fakesurveyDyn) ``` Same as before, we have to make sure the ids are right ```{r} # Fixing ids fakesurveyDyn$id <- with(fakesurveyDyn, group*100 + id) # An individual who is alone fakeDynEdgelist[11,"value"] <- 1 ``` ```{r} diffnet <- edgelist_to_diffnet( edgelist = fakeDynEdgelist[,1:2], # As usual, a two column dataset w = fakeDynEdgelist$value, # Here we are using weights t0 = fakeDynEdgelist$time, # An integer vector with starting point of spell t1 = fakeDynEdgelist$time, # An integer vector with the endpoint of spell dat = fakesurveyDyn, # Attributes dataset idvar = "id", toavar = "toa", timevar = "time", keep.isolates = TRUE # Keeping isolates (if there's any) ) diffnet ```