Pages

banner ads

Thursday, September 12, 2019

my first book

Contents:

Saturday, May 11, 2013

Hacking eBay: one more 'R programming is fun!'

How about scoring an iphone 30% less than average selling price on eBay?

Plenty of strategies could be found online. However, here is a fun programmatic way of doing it (Thanks to my friends Frank and Paul for inspiration).

It is possible by using 'saved search' + frequently refresh the search!

There are tons of deals on eBay if you can find it. It takes a lot of time to get a bargain though. 'saved search' is a nice feature which could be used to filter out garbage and narrow down to certain price range. For example, searching 'iphone 5 black 16gb' would return a lot of garbage such as accessory or damaged phones. Advanced search provide some options to clean the returned listings (see attached screen shots end of the post. Make sure putting a few exclusion tokens.)

After a nice 'saved search' is created for a price range, the key is to get the deal a few seconds after someone listed the product within the price range. One could keep push the 'saved search' on My eBay link to refresh the search, or one could automate the process. The fun part is the the automation. I guess automation is more fun for me than actually scoring a deal.

The URL looks like following for a saved search:
http://www.ebay.com/sch/Cell-Phones-Accessories-/15032/i.html?_sadis=200&LH_SALE_CURRENCY=0&_from=R40&_samihi=&_fpos=&_oexkw=&LH_Time=1&_sop=12&LH_PayPal=1&_okw=&_fsct=&_ipg=50&LH_BIN=1&LH_ItemCondition=4&_samilow=&_udhi=400&_ftrt=903&_sabdhi=&_udlo=300&_ftrv=1&_sabdlo=&_adv=1&_dmd=1&_mPrRngCbx=1&_nkw=iphone+5+16gb+black+-sprint+-t-mobile+-bad+-water&saveon=2

You can see the keyword, price range, filtering criteria, etc. in the URL.
Here is a simple program to automate the process. One could keep it running every second and when deal shows up. An email alert would tell one to rush to eBay and buy it. If no deals, the program should not alert with email.

Steps in R code. It could be easily scheduled through windows scheduler. Note I have email feature set up to email. You may not have same set up. I am sure you can use alternative set up as I mentioned in another post on email script.

url='xxxx'   #copy URL here
x<-paste(readLines(myurl, warn=F), collapse="\n")   # get the eBay search result page
n=gsub("^.*\\s+(\\d+)\\s+active listings.*?$", "\\1", gsub("\\n|<[^>]+>", " ", x, perl=T), perl=T) # identify items found

if (n>0) {
 write(x, file='test.html')
 mycmd<-paste("email.exe -s ", "found", " -html -tls -c email.gmail  xxxx@gmail.com < test.html", sep="")
 shell(mycmd)
}

Happy shopping!


Snapshot of how 'Saved Search' could be configured.







Product is specified here!

Exclusion is here!


















Price range should be specified 30% below the average selling price for the item.
'Buy it Now' is the best format for this kind of deal.








Only show newly listed items.


Tuesday, May 7, 2013

Another complex sql operation example


postgresql array function

transpose a list of ids into a flat delimited row. the rOw could serve as a filter for subsequent queries:

SELECT
array_to_string(array_agg(trim(id)),’,')
FROM terms
Let’s say terms table has a list of ids:
id
1
2
3
after query we get:
1,2,3

Oracle way of doing this from my friend Darrin.
Here is the Oracle way to do this…

Select                     listagg(account_id,’,') within group (order by account_id asc)
From                      gg_account

You can also do

Select                      portfolio_id,
listagg(account_id,’,') within group (order by account_id asc)
From                      gg_account
Group by                portfolio_id

Sunday, April 28, 2013

Productivity tool: automation of emailing a report based on a SQL job result through Windows 7 task scheduler

'My middle name is automation', I tell my colleagues jokingly sometimes. My love for programming is mostly because of my urge to automate things. Here is a very useful script for automating some monkey job most data analysts hate doing: reporting.

Who doesn't have lots of monkey jobs (mundane manual work) we have to do every day? I wish I could have my teeth brushed automatically everyday.... That is not possible. But it is possible to automate reporting if we have it canned. I now have many emails with canned reports sitting in my inbox every morning before I wake up. This is improving my productivity like leaps and bounds without some expensive or crazily complex processes. I am quite satisfied with the result given the hacking nature of it. 


Attached is the script in blue (saved it as reporter.vsb). It is very easy to schedule it in Windows scheduler to make the machine run a query, format the output, and send the report to folks automatically. A snapshot of how to schedule it in scheduler: Action is to 'Start a program', then 'wscript' is the name of the program.









The arguments would be something like this:
reporter.vbs filename sqlname subject liuyunliu@gmail.com,liuyunliu@yahoo.com

Following script is almost self explanatory. It gets a list of arguments:
1. filename: filename to save output from sql query
2. sqlname: file where the sql query is saved
3. subject: subject for the email
4. a list of emails in to: section of email

Nothing fancy here, but it does the job.....One thing to keep in mind. Since my data is on Greenplum, I have the SQL connection and execution through psql.exe. Make sure you get PSQL.EXE setup with saved password. Maybe I will share how to do that in a different post. 

Leave a comment if you have questions or need help setting this up.

'Read me:
'how to use the script:
'wscript reporter.vbs filename sqlname subject emaillist:1;2
'remember to set up the SQL utility. I use psql.exe for greemplum....

filename=Wscript.arguments(0)
sqlname=wscript.arguments(1)
subject=wscript.arguments(2)
emaillist=wscript.arguments(3)

Set WshShell = WScript.CreateObject("WScript.Shell")

intReturn =WshShell.Run("psql.exe -d p1gp1 -h xxx.xxxx.xxxx -p 5432 -U username -H -f " & sqlname & " -o " & filename , 0,TRUE)

'These constants are defined to make the code more readable
Const ForReading = 1, ForWriting = 2, ForAppending = 8
Dim fso, f
Set fso = CreateObject("Scripting.FileSystemObject")
'Open the file for reading
Set f = fso.OpenTextFile(filename, ForReading)
'The ReadAll method reads the entire file into the variable BodyText
BodyText = f.ReadAll
'Close the file
f.Close
Set f = Nothing
Set fso = Nothing


Set olApp = CreateObject("Outlook.Application")
Set olMsg = olApp.CreateItem(0)

With olMsg
  .To = emaillist
  .Subject = subject
       '.BodyFormat = olFormatHTML
       '.HTMLBody = "<HTML><BODY>Enter the message text here. </BODY></HTML>"
  .HTMLBody =BodyText
  .Attachments.Add filename 

  '.Display
  .send
End With

''''This script is partly based on prior work of following.
''''http://www.paulsadowski.com/wsh/cdo.htm#LoadFromFile

Saturday, April 27, 2013

should one start a 'web' company?

So I was thinking: what kind of company people are starting now? It must be 'web' based company, right?

In the Silicon Valley, start ups are like mushroom popping up after a spring rain. Truth be told, I am thinking of starting something... Who isn't? Even though one knows that maybe only less than one out of thousand startup became successful eventually, the cruel fact doesn't stop everyone from trying (thinking in my case).
Naturally, I want to see some data and get some idea of what's happening. Crunchbase is the ideal location for gathering some data and have some fun.

After scraping crunchbase for companies name and basic information like start date, I got a very rough barchart to tell what kind of companies are out there. This chart is in no way pretty or complete with all labels. But it is sufficient to tell my story. Forgive me if I did not spend time to make it beautiful (as I should).

(count of companies by category)

Crunchbase returned 118k or so companies.  40% of those was not properly categorized (cat=='other' or null). But roughly speaking, there are a lot more web/software companies than other categories. Oddly, there is a very small amount of 'search' companies. Is that because Google/Bing are the dominant players already so no more space for opportunity in search?
I guess 'web' and 'software' companies are most popular. Keep in mind that I don't know how crunchbase categorize companies. That's besides the point since we got the idea already. One thing I am scratching my head is that 'big data' wasn't even a category... weird!

Another question I have is when these companies started. In order to get that, we have to loop over API calls to get details for every company. since I have No intention to make 118k calls. So I just randomly get the start date for about 2% of the web companies. A quick and dirty Barchart is enclosed here. Interesting that most of them was started in 2010. Maybe crunchbase hasn't got wind of all the companies started in 2011, 2012? But one could see the clear rising trend since 2005.

(web companies by starting year)






If interested, check out my sample R code leveraging the crunchbase API calls.:


library(RCurl)
library(rjson)
url <-getURL('http://api.crunchbase.com/v/1/company/facebook.jsapi_key=mwfrwmfswv9wk8z6tcb3rbxd')
document <- fromJSON(url, method='C')

facebook=unlist(document)

x=document[1:12]
x1=rbind(sapply(x,unlist))
x2=sapply(x,unlist)
x3=sapply(x,unlist)
xx=rbind(x2,x3)


#
#getting a list of companies
#
company=file("c:\\users\\yun.liu\\downloads\\companies.json.js")
comp=fromJSON(file=company, method='C')


y=do.call('rbind',comp)

name=sapply(y[,1],unlist)
permlink=sapply(y[,2],unlist)
cate=sapply(y[,3],unlist)
web=sapply(y[y[,3]=='web',2],unlist)

yy=data.frame(cbind(name,permlink,category))

xx=do.call('rbind', sapply(cate,unlist))
tt=table(xx)
barplot(sort(tt),las=2)
tt['other']=tt['other']+length(cate)-sum(tt)
barplot(sort(tt),las=2)



####   library
if(require(RCurl)==F) {
update.packages(repos=c("http://cran.cnr.Berkeley.edu", "http://www.stats.ox.ac.uk/pub/RWin"), ask=F)
install.packages("RCurl", repos=c("http://cran.cnr.Berkeley.edu", "http://www.stats.ox.ac.uk/pub/RWin"), dependencies=T)
require(RCurl)
}
if(require(rjson)==F) {
  update.packages(repos=c("http://cran.cnr.Berkeley.edu", "http://www.stats.ox.ac.uk/pub/RWin"), ask=F)
  install.packages("rjson", repos=c("http://cran.cnr.Berkeley.edu", "http://www.stats.ox.ac.uk/pub/RWin"), dependencies=T)
  require(rjson)
}


#### for each company
#### getting the details  based on permalink

ScrapeTechcrunchie<-function(permalink)
 {
  url <-getURL(paste('http://api.crunchbase.com/v/1/company/',permalink, '.js?api_key=mwfrwmfswv9wk8z6tcb3rbxd',sep=""),followlocation=TRUE)
   #if(nchar(url)<200 & grep('redirected',url)>0 ) 
    #{
    #url <-getURL(paste('http://ec2-107-21-104-179.compute-1.amazonaws.com/v/1/company/',permalink, '.js',sep=""))
    #} 
  document <- fromJSON(url, method='C',unexpected.escape = "skip")
  #myt<-Sys.time()
  document[1:12]
  
}
RunScrape<-function(mylist)
{
  #mylist=permlink[1:12]

  out<-NULL
  for(i in 1:length(mylist))
  {
    try(
       {
         tmp<-ScrapeTechcrunchie(mylist[i])
         out<-rbind(out, tmp)
         print(paste("i=", i, " company=", mylist[i],  sep=""))
         Sys.sleep(0.5)
       }, silent=T)
  }
  
  print(paste("Scraping done! "))
  out
}


sampl=(runif(nrow(web),0,1)<0.02)
web1=web[sampl,2]
ttt=RunScrape(web1)
table(unlist(ttt[,10]))
barplot(table(unlist(ttt[,10])))


Friday, April 26, 2013

mapping people flying between cities

A couple of weeks ago, I was interested in how the Facebook Friendship Map was created using R. Here is my learning experience. After reading a few blogs about the Facebook chart and mapping in general. I was able to create very similar chart. It is super interesting.

Here is my rendering of a sample data set with number of flights between cities for a day. My sample data has a concentration of flights for USA. Curves with blue color tells us that there is more flights than curves with grey colors. One can make out visually the boundary of the continents. It is not nearly as nice as the original. But I am happy that I got this far! A few more tweaks might make it much better.



Following is the rough process to get there.
1. Getting familiar with the Orientation of the R map is very important. Longitude [-180, 180] and Latitude [-100, 100] basically are x, y ranges for the space. Here is a few points, lines and curves to illustrated the layout the map.

library(maps)
library(geosphere)
#mapping test

map("world", col="#f2f2f2", fill=TRUE, bg="white", lwd=0.01) #, xlim=xlim, ylim=ylim)

segments(0,0,100,100, col='red', lwd=2)
lines(c(0,-100 ),c(0,-100), col='blue',lwd=2)
points(180,0,col='red',lwd=2)
points(-180,0,col='red',lwd=2)
points(1:10*-10, 1:10*10, col='red',lwd=2)
inter <- gcIntermediate(c(-100,40), c(140,-37.331979), n=50, addStartEnd=TRUE, breakAtDateLine=T)
inter1 <- gcIntermediate(c(100,40), c(140,-37.331979), n=50, addStartEnd=TRUE, breakAtDateLine=T)
lines(inter1,col='red')


2.  The curvature is another key concept which needs careful design. I draw a curve from China to Australia on above chart. gcIntermediate() function is great in drawing arc over two points on the map. However, there are a few things deserve mentioning. First of all, breakAtDateLine=T is needed so that we don't end up with a lot of horizontal lines cross the map which connects dateline. Secondly, the arc is pretty high sometimes which visually is not pretty. I think a bit work could be done to make the arc lower. I will save the work to be next blog.

3. Finally, in order to create a nice chart with an arc between many pairs of points, one needs a few tricks to manage the rendering of color and layout.Background color and arc color could be easily changed. Example code is included here.


trip=read.csv("od_cnt.csv",header=T)
trip=read.csv('odpairs.csv',header=T)
fsub <- trip
maxcnt <- max(fsub$count_itiniery)
fsub <- fsub[order(fsub$count_itiniery),]

pal <- colorRampPalette(c("#f2f2f2", "red"))
pal <- colorRampPalette(c('grey',"blue", "white"))
colors <- pal(100)


pdf("test4.pdf",height=100,width=120)
map("world", col="black", fill=F, bg="black", lwd=1) #,xlim=xlim, ylim=ylim)



for (i in 1:length(fsub$count_itiniery)) {
#for (i in 1:100) {
  inter <- gcIntermediate(c(fsub[i,]$orig_long, fsub[i,]$orig_lat), c(fsub[i,]$dest_long, fsub[i,]$dest_lat), n=100, addStartEnd=TRUE,breakAtDateLine=T)
  #inter <-greatCircle(c(fsub[i,]$orig_long, fsub[i,]$orig_lat), c(fsub[i,]$dest_long, fsub[i,]$dest_lat), n=100, sp=F)
  #inter <- clean.Inter(c(fsub[i,]$orig_long, fsub[i,]$orig_lat), c(fsub[i,]$dest_long, fsub[i,]$dest_lat),n=100, addStartEnd=TRUE)
  colindex <- round( (fsub[i,]$count_itiniery / maxcnt) * length(colors) )
  lenwd<- (fsub[i,]$count_itiniery / maxcnt)
  if (length(inter)>2) {
  # lines(inter, col=colors[colindex], lwd=lenwd)
  lines(inter, col=colors[colindex], lwd=lenwd)
  }
  else {
    lines(inter[[1]],col=colors[colindex],lwd=lenwd)
    lines(inter[[2]],col=colors[colindex],lwd=lenwd)
  }
}
dev.off()

Monday, April 1, 2013

big deal about big data

I have been going to meetups this year: Two or three meetups every week. check it out on www.meetup.com. So meetup has been and will be a big deal for the coming year.

Mostly I join the "big data", "start up" meetups. There are so many of them. All of sudden, everyone is either starting a meetup or going to some meetup. It is a great experience to be able to see all the greatest and latest trends in technology. Aside from the learning experience, there is also a great benefit of meeting great folks. So I think big data has been and will be a big deal for the coming year.

The one meetup on big data that is organized monthly by 'the HIVE' is arguably the best meetup for the following reasons:
1. It is organized nicely: great venue, great food, smooth flow.
2. Best presenters are invited: the quality is the highest among all the meetups that I have been to.
3. Great people are showing up. Truth be told, a lot of the folks keep showing up for the wine and food. But I think in general high caliber folks are attracted.

Personally, I think big data is a hype. There is really not that big a deal about it. However, a lot of companies is making huge amount of money off it: Google, Amazon, Linkedin, etc. So what do I know? To get a good sense of current issues debated by the communities, check out this link.

http://www.sipa.org/cms/index.php?option=com_content&view=article&id=190:big-data-is-big-business-february-2013-event-of-sipa&catid=35:event&Itemid=54