Results

Here comes the sun.

Well, I am happy with the results, but not entirely happy. I know I could had reached more if I was not stuck on only one task.

The results are: I extracted the data from this website http://data.worldbank.org/, regarding GDP and Labor Force.

My challenge was cleaning the data and make it readable in any BI tool or Excel spreadsheet.

See the code I used to achieve my goal. I wrote some comments on it to make it easier to understand.

GDP and Labor Force

# read the file and create a data frame from it
GDP <- read.delim("C:/Users/xxxxxx/Desktop/R_Projects/Open_Data/GDP_1960-2014.csv", header=TRUE)
Labor_Force <- read.delim("C:/Users/xxxx/Desktop/R_Projects/Open_Data/Labor_Force_1960-2014_.csv", header=TRUE)

# filter only the columns you need
GDP_filt <- GDP[,c(1,2,49,50,51,52,53,54,55,56,57,58)]

# filter the rows which contain the data you want
GDP_LA <- subset(GDP_filt, Country.Name == "Brazil" | Country.Name == "Argentina" | Country.Name == "Chile")

# use this package to call a function to transpose your data frame
install.packages("reshape2")
library(reshape2)
GDP_transp <- melt(GDP_LA, id = c("Country.Name","Country.Code"))

# remove "X" character from column "Variable"
GDP_transp$variable <- gsub("X", "", GDP_transp$variable)

# rename the columns
colnames(GDP_transp)[3] <- "Year"
colnames(GDP_transp)[4] <- "GDP.Value"

# create an ID to merge the tables
GDP_transp$ID <- paste(GDP_transp$Country.Code, GDP_transp$Year, sep="")

# move columns to the position you want
GDP_transp <- GDP_transp[,c(5,2,1,3,4)]

## Now, let's do all steps again with the other file

LF_filt <- Labor_Force[,c(1,2,49,50,51,52,53,54,55,56,57,58,59)]
LF_LA <-subset(LF_filt, Country.Name == "Brazil" | Country.Name == "Argentina" | Country.Name == "Chile")
LF_transp <- melt(LF_LA, id = c("Country.Name","Country.Code"))
LF_transp$variable <- gsub("X", "", LF_transp$variable)
colnames(LF_transp)[3] <- "Year"
colnames(LF_transp)[4] <- "Labor.Value"
LF_transp$ID <- paste(LF_transp$Country.Code, LF_transp$Year, sep="")
LF_transp <- LF_transp[,c(5,2,1,3,4)]

# merge the data frames by ID as Primary Key
x <- merge(LF_transp, GDP_transp, by = "ID")

# rename all columns
colnames(x)[2] <- "Country.Code"
colnames(x)[3] <- "Country.Name"
colnames(x)[4] <- "Year"
colnames(x)[5] <- "LaborF.Value"
colnames(x)[6] <- "GDP.Value"

# create subsets to individually run the regression 
x_ARG <- subset(x, Country.Code == "ARG")
x_CHL <- subset(x, Country.Code == "CHL")
x_BRA <- subset(x, Country.Code == "BRA")

# run the regression
model_ARG <- lm(x_ARG$GDP.Value ~ x_ARG$LaborF.Value)
model_CHL <- lm(x_CHL$GDP.Value ~ x_CHL$LaborF.Value)
model_BRA <- lm(x_BRA$GDP.Value ~ x_BRA$LaborF.Value)

And the results are: All the models are statistically significant. It means that approximately 90% of increasing in GDP follows by rising in Labor Force.

If I am saying something wrong, let me know by leaving a comment below. Thank you.

> summary(model_BRA)

Call:
lm(formula = x_BRA$GDP.Value ~ x_BRA$LaborF.Value)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.746e+11 -1.209e+11 -2.519e+10  1.029e+11  3.511e+11 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -1.158e+13  1.272e+12  -9.103 1.70e-05 ***
x_BRA$LaborF.Value  1.315e+05  1.258e+04  10.455 6.08e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.902e+11 on 8 degrees of freedom
Multiple R-squared:  0.9318,    Adjusted R-squared:  0.9233 
F-statistic: 109.3 on 1 and 8 DF,  p-value: 6.084e-06

--------------------------------------------------------------------------------

> summary(model_CHL)

Call:
lm(formula = x_CHL$GDP.Value ~ x_CHL$LaborF.Value)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.701e+10 -7.074e+09 -6.873e+07  4.176e+09  1.610e+10 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -4.139e+11  3.738e+10  -11.07 3.95e-06 ***
x_CHL$LaborF.Value  7.987e+04  4.912e+03   16.26 2.06e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.082e+10 on 8 degrees of freedom
Multiple R-squared:  0.9706,    Adjusted R-squared:  0.967 

F-statistic: 264.4 on 1 and 8 DF,  p-value: 2.059e-07

--------------------------------------------------------------------------------

> summary(model_ARG)

Call:
lm(formula = x_ARG$GDP.Value ~ x_ARG$LaborF.Value)

Residuals:
       Min         1Q     Median         3Q        Max 
-4.971e+10 -2.701e+10  4.855e+09  2.240e+10  4.980e+10 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -4.741e+12  3.945e+11  -12.02 2.12e-06 ***
x_ARG$LaborF.Value  2.799e+05  2.146e+04   13.04 1.13e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.519e+10 on 8 degrees of freedom
Multiple R-squared:  0.9551,    Adjusted R-squared:  0.9495 

F-statistic: 170.1 on 1 and 8 DF,  p-value: 1.134e-06

See you tomorrow.

P.S.: This website is excellent to see any information about R - http://www.r-statistics.com/

« Keep trying Networking »

Renata Cabral

Results

Here comes the sun.