So I was thinking: what kind of company people are starting now? It must be 'web' based company, right?
In the Silicon Valley, start ups are like mushroom popping up after a spring rain. Truth be told, I am thinking of starting something... Who isn't? Even though one knows that maybe only less than one out of thousand startup became successful eventually, the cruel fact doesn't stop everyone from trying (thinking in my case).
Naturally, I want to see some data and get some idea of what's happening. Crunchbase is the ideal location for gathering some data and have some fun.
After scraping crunchbase for companies name and basic information like start date, I got a very rough barchart to tell what kind of companies are out there. This chart is in no way pretty or complete with all labels. But it is sufficient to tell my story. Forgive me if I did not spend time to make it beautiful (as I should).
(count of companies by category)
Crunchbase returned 118k or so companies. 40% of those was not properly categorized (cat=='other' or null). But roughly speaking, there are a lot more web/software companies than other categories. Oddly, there is a very small amount of 'search' companies. Is that because Google/Bing are the dominant players already so no more space for opportunity in search?
I guess 'web' and 'software' companies are most popular. Keep in mind that I don't know how crunchbase categorize companies. That's besides the point since we got the idea already. One thing I am scratching my head is that 'big data' wasn't even a category... weird!
Another question I have is when these companies started. In order to get that, we have to loop over API calls to get details for every company. since I have No intention to make 118k calls. So I just randomly get the start date for about 2% of the web companies. A quick and dirty Barchart is enclosed here. Interesting that most of them was started in 2010. Maybe crunchbase hasn't got wind of all the companies started in 2011, 2012? But one could see the clear rising trend since 2005.
(web companies by starting year)
If interested, check out my sample R code leveraging the crunchbase API calls.:
library(RCurl)
library(rjson)
url <-getURL('http://api.crunchbase.com/v/1/company/facebook.jsapi_key=mwfrwmfswv9wk8z6tcb3rbxd')
document <- fromJSON(url, method='C')
facebook=unlist(document)
x=document[1:12]
x1=rbind(sapply(x,unlist))
x2=sapply(x,unlist)
x3=sapply(x,unlist)
xx=rbind(x2,x3)
#
#getting a list of companies
#
company=file("c:\\users\\yun.liu\\downloads\\companies.json.js")
comp=fromJSON(file=company, method='C')
y=do.call('rbind',comp)
name=sapply(y[,1],unlist)
permlink=sapply(y[,2],unlist)
cate=sapply(y[,3],unlist)
web=sapply(y[y[,3]=='web',2],unlist)
yy=data.frame(cbind(name,permlink,category))
xx=do.call('rbind', sapply(cate,unlist))
tt=table(xx)
barplot(sort(tt),las=2)
tt['other']=tt['other']+length(cate)-sum(tt)
barplot(sort(tt),las=2)
#### library
if(require(RCurl)==F) {
update.packages(repos=c("http://cran.cnr.Berkeley.edu", "http://www.stats.ox.ac.uk/pub/RWin"), ask=F)
install.packages("RCurl", repos=c("http://cran.cnr.Berkeley.edu", "http://www.stats.ox.ac.uk/pub/RWin"), dependencies=T)
require(RCurl)
}
if(require(rjson)==F) {
update.packages(repos=c("http://cran.cnr.Berkeley.edu", "http://www.stats.ox.ac.uk/pub/RWin"), ask=F)
install.packages("rjson", repos=c("http://cran.cnr.Berkeley.edu", "http://www.stats.ox.ac.uk/pub/RWin"), dependencies=T)
require(rjson)
}
#### for each company
#### getting the details based on permalink
ScrapeTechcrunchie<-function(permalink)
{
url <-getURL(paste('http://api.crunchbase.com/v/1/company/',permalink, '.js?api_key=mwfrwmfswv9wk8z6tcb3rbxd',sep=""),followlocation=TRUE)
#if(nchar(url)<200 & grep('redirected',url)>0 )
#{
#url <-getURL(paste('http://ec2-107-21-104-179.compute-1.amazonaws.com/v/1/company/',permalink, '.js',sep=""))
#}
document <- fromJSON(url, method='C',unexpected.escape = "skip")
#myt<-Sys.time()
document[1:12]
}
RunScrape<-function(mylist)
{
#mylist=permlink[1:12]
out<-NULL
for(i in 1:length(mylist))
{
try(
{
tmp<-ScrapeTechcrunchie(mylist[i])
out<-rbind(out, tmp)
print(paste("i=", i, " company=", mylist[i], sep=""))
Sys.sleep(0.5)
}, silent=T)
}
print(paste("Scraping done! "))
out
}
sampl=(runif(nrow(web),0,1)<0.02)
web1=web[sampl,2]
ttt=RunScrape(web1)
table(unlist(ttt[,10]))
barplot(table(unlist(ttt[,10])))
No comments:
Post a Comment