Here comes the sun.
Well, I am happy with the results, but not entirely happy. I know I could had reached more if I was not stuck on only one task.
The results are: I extracted the data from this website http://data.worldbank.org/, regarding GDP and Labor Force.
My challenge was cleaning the data and make it readable in any BI tool or Excel spreadsheet.
See the code I used to achieve my goal. I wrote some comments on it to make it easier to understand.
GDP and Labor Force
# read the file and create a data frame from it
GDP <- read.delim("C:/Users/xxxxxx/Desktop/R_Projects/Open_Data/GDP_1960-2014.csv", header=TRUE)
Labor_Force <- read.delim("C:/Users/xxxx/Desktop/R_Projects/Open_Data/Labor_Force_1960-2014_.csv", header=TRUE)
# filter only the columns you need
GDP_filt <- GDP[,c(1,2,49,50,51,52,53,54,55,56,57,58)]
# filter the rows which contain the data you want
GDP_LA <- subset(GDP_filt, Country.Name == "Brazil" | Country.Name == "Argentina" | Country.Name == "Chile")
# use this package to call a function to transpose your data frame
install.packages("reshape2")
library(reshape2)
GDP_transp <- melt(GDP_LA, id = c("Country.Name","Country.Code"))
# remove "X" character from column "Variable"
GDP_transp$variable <- gsub("X", "", GDP_transp$variable)
# rename the columns
colnames(GDP_transp)[3] <- "Year"
colnames(GDP_transp)[4] <- "GDP.Value"
# create an ID to merge the tables
GDP_transp$ID <- paste(GDP_transp$Country.Code, GDP_transp$Year, sep="")
# move columns to the position you want
GDP_transp <- GDP_transp[,c(5,2,1,3,4)]
## Now, let's do all steps again with the other file
LF_filt <- Labor_Force[,c(1,2,49,50,51,52,53,54,55,56,57,58,59)]
LF_LA <-subset(LF_filt, Country.Name == "Brazil" | Country.Name == "Argentina" | Country.Name == "Chile")
LF_transp <- melt(LF_LA, id = c("Country.Name","Country.Code"))
LF_transp$variable <- gsub("X", "", LF_transp$variable)
colnames(LF_transp)[3] <- "Year"
colnames(LF_transp)[4] <- "Labor.Value"
LF_transp$ID <- paste(LF_transp$Country.Code, LF_transp$Year, sep="")
LF_transp <- LF_transp[,c(5,2,1,3,4)]
# merge the data frames by ID as Primary Key
x <- merge(LF_transp, GDP_transp, by = "ID")
# rename all columns
colnames(x)[2] <- "Country.Code"
colnames(x)[3] <- "Country.Name"
colnames(x)[4] <- "Year"
colnames(x)[5] <- "LaborF.Value"
colnames(x)[6] <- "GDP.Value"
# create subsets to individually run the regression
x_ARG <- subset(x, Country.Code == "ARG")
x_CHL <- subset(x, Country.Code == "CHL")
x_BRA <- subset(x, Country.Code == "BRA")
# run the regression
model_ARG <- lm(x_ARG$GDP.Value ~ x_ARG$LaborF.Value)
model_CHL <- lm(x_CHL$GDP.Value ~ x_CHL$LaborF.Value)
model_BRA <- lm(x_BRA$GDP.Value ~ x_BRA$LaborF.Value)
And the results are: All the models are statistically significant. It means that approximately 90% of increasing in GDP follows by rising in Labor Force.
If I am saying something wrong, let me know by leaving a comment below. Thank you.
> summary(model_BRA)
Call:
lm(formula = x_BRA$GDP.Value ~ x_BRA$LaborF.Value)
Residuals:
Min 1Q Median 3Q Max
-2.746e+11 -1.209e+11 -2.519e+10 1.029e+11 3.511e+11
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.158e+13 1.272e+12 -9.103 1.70e-05 ***
x_BRA$LaborF.Value 1.315e+05 1.258e+04 10.455 6.08e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.902e+11 on 8 degrees of freedom
Multiple R-squared: 0.9318, Adjusted R-squared: 0.9233
F-statistic: 109.3 on 1 and 8 DF, p-value: 6.084e-06
--------------------------------------------------------------------------------
> summary(model_CHL)
Call:
lm(formula = x_CHL$GDP.Value ~ x_CHL$LaborF.Value)
Residuals:
Min 1Q Median 3Q Max
-1.701e+10 -7.074e+09 -6.873e+07 4.176e+09 1.610e+10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.139e+11 3.738e+10 -11.07 3.95e-06 ***
x_CHL$LaborF.Value 7.987e+04 4.912e+03 16.26 2.06e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.082e+10 on 8 degrees of freedom
Multiple R-squared: 0.9706, Adjusted R-squared: 0.967
F-statistic: 264.4 on 1 and 8 DF, p-value: 2.059e-07
--------------------------------------------------------------------------------
> summary(model_ARG)
Call:
lm(formula = x_ARG$GDP.Value ~ x_ARG$LaborF.Value)
Residuals:
Min 1Q Median 3Q Max
-4.971e+10 -2.701e+10 4.855e+09 2.240e+10 4.980e+10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.741e+12 3.945e+11 -12.02 2.12e-06 ***
x_ARG$LaborF.Value 2.799e+05 2.146e+04 13.04 1.13e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.519e+10 on 8 degrees of freedom
Multiple R-squared: 0.9551, Adjusted R-squared: 0.9495
F-statistic: 170.1 on 1 and 8 DF, p-value: 1.134e-06
See you tomorrow.
P.S.: This website is excellent to see any information about R - http://www.r-statistics.com/