Change Contents of the Bubble
Welcome to CS1315. Click on the python to add comments.

Looking for the book? They have it at the Engineer's Bookstore at 748 Marietta St NW. Here is there website: http://www.engrbookstore.com/ - Monica

Hotspots: Slides and CodeTA CornerComments?AnnouncementsFAQStatic Webspace
View this PageEdit this Page (locked)Uploads to this PageHistory of this PageHomeRecent ChangesSearchHelp Guide

JESPortion

Primer: The JES Part


Objective:


Getting Started


Lets look at the code block by block



  
def ExcelLab():
  disneyPricesWebpage = "http://finance.yahoo.com/q/hp?s=DIS"
  outputFileLocation = "/Users/lawrenceolson/desktop/DisneyPrices.txt"
  info = grabWebpage(disneyPricesWebpage)
  breakdownYahooPrices(info, outputFileLocation)


This is the first block of code which is really an entire procedure. There are five lines.
  1. This line is a standard def that we are all familiar with by now.
  2. This line gives the location of the webpage where we will find our data. In this case we will be looking at the historical prices for the Walt Disney company, DIS.
  3. This line gives the location of where we want to write our output file. Notice that despite the fact that we will be using this generated file in conjunction with Excel, it is still a plain old text file.
  4. This calls the function grabWebpage(). GrabWebpage() is passed in the location of Disney's historical prices on Yahoo. After the function is done, the data on the webpage (the HTML) is stored in the variable "info". This will keep the data around so that we can use it later.
  5. This is the final line. It is a call to the procedure breakDownYahooPrices()



import urllib

def grabWebpage(webpageLocation):
  webpageConnection = urllib.urlopen(webpageLocation)
  webpageInfo = webpageConnection.read()
  webpageConnection.close()
  return webpageInfo


This is a general function that will grab a webpage from a given location and the return its information (the HTML). You may have noticed that I included the import line with this block of code. I did this because this is the function that actually uses methods from urllib. Technically speaking, the two procedures, ExcelCode() and breakdownYahooPrices() could also use methods from urllib. Once again, we have five lines of code to look at.
  1. The standard def.
  2. This opens a connection to the webpage given by webpageLocation. In our case Yahoo's historical prices page for Disney.
  3. This reads the information from the webpage and stores it in webpageInfo.
  4. This closes the connection to the yahoo page.
  5. This returns webpageInfo.



 def breakdownYahooPrices( htmlInfo, outputFileLocation):
  outputFile = open(outputFileLocation, "wt")
  startPhrase = 'width="14%">Date'
  stopPhrase = '</td></tr><tr><td class="yfnc_tabledata1" colspan="7" align="center"> * <small>'
  index = htmlInfo.find(startPhrase)
  stoppingPoint = htmlInfo.find(stopPhrase,index)


This is the beginning of our main function that will actually get the particular data we desire and format it for us so that excel can use it. The six lines can be broken down as below.
  1. The standard def. This def takes procedure take in the information we extracted from the webpage using grabWebpage() as htmlInfo. It also takes in the location where we want to store our final file. This location is passed in as outputFileLocation.
  2. This line opens a file at the location we gave it and prepares it for "wt"; that is Writing Text. It then gives us the "opened file". We are storing that "opened file" as outputFile.
  3. This is the html we will use to determine our starting point. This is page dependent and if yahoo ever changed the format of their page, this would need to be updated.
  4. This is the html we will use to determine our stopping point. Once again, this is page dependent.
  5. This finds our starting point and the stores its numerical location in the variable index.
  6. This finds our stopping point and then stores its numerical location in the variable stopping point.



while(index <= stoppingPoint):


This is a loop type that we have not really looked at much, Lets go over it.

What are all of the rules governing while? Well, you must have some condition that can be checked for truth. This is not to say that you cannot have more than one. You could have any number of conditions. For instance, while((something is true) and (something else is true)). This would check two conditions. You can also use "or" in place of "and". One last thing is tricky about whiles. You almost always will need to have an update step that changes some part of you condition. If you do not, your while loop may run forever.

You may be wondering how in the world we came up with the criteria of index = stoppingPoint. Well, there is not real "right" answer to this. More or less we studied what the html looked like and based our criteria on that.

There are however, some more obvious requirements that can be put in a while loop (although we did not here). The first is to check your location versus file size. That condition would be written "index len(page)". We want information from whatever webpage we are looking at. Therefore, if we are not on that page anymore, then we are bigger than its length, and we do not need to worry about it anymore. Something else to check for is that some text can still be found. For instance, if we wanted to search in a table, a potential condition for that would be "page.find("</table>",index) != -1". This means that if we can still find the /table tag when looking from our current location, then keep going.

Before we continue, we need to know what in the world find() and rfind() are. Well they work in the same ways but they have some properties that may be kind of tricky until you get used to them.
This is the same as find except that it will start from the end and work towards the beginning.
Lets say we have a string, "ABACABB" and we store that in "page2".
Here are some examples with one argument of what might happen.
>> page2 = "ABACABB"
>>> page2.find("a")
-1
>>> page2.find("A")
0
>>> page2.find("B")
1
>>> page2.find("C")
3
>>> page2.rfind("B")
6
>>> page2.rfind("A")
4


Now with two arguments...
>>> B1 = page2.find("B")
>>> print B1
1
>>> B2 = page2.find("B",B1+1)
>>> print B2
5

Whoaaaa why in the world do I have B1 + 1. Well, find does inclusive/exclusive searches. Remember in math, things like [1,5] vs. [1,5)? Find follows the form [start,end). If we give the beginning of the search for B2 (the second B) as B1 Then it will find the location of B1 over and over again. This is because the first place find will search is B1 inclusive. It finds a B at location B1 and says "oop, I'm done" So we need find to go beyond B1 and thus we start one place beyond it.


Here are some examples of the inclusive/exclusive nature of find.
>>> page= "ABCD"
>>> page.find("A")
0
>>> page.find("A",0)
0
>>> page.find("B",0,1)
-1
>>> page.find("B",0,2)
1


Now onto rfind ...


Remember, here we are going back to the string "ABACABB"
Thus page2 = "ABACABB"
>>> LastB = page2.rfind("B")
>>> print LastB
6
>>> nextToLastB = page2.rfind("B",LastB)
>>> print nextToLastB
6
>>> nextToLastB = page2.rfind("B",LastB-1)
>>> print nextToLastB
6
>>> nextToLastB = page2.rfind("B",LastB-2)
>>> print nextToLastB
6
>>> nextToLastB = page2.rfind("B",LastB-3)
>>> print nextToLastB
6

It seems we have another problem. We found the last "B" easily. However, what we wanted was to find the B right before it. So, we tried to do another rfind() but this time starting at our LastB. What we got was the same location. So we thought that maybe we were having some inclusion/exclusion problem again. So, knowing that we were at the end of the string and working our way backwards, we decided not to include the end of the string and subtracted 1. It still did not work. So we moved on and tried -2, and -3. Still no luck, why not? Well, when find() or rfind() gets a second argument it interprets that as the starting point of the string its going to search. In the case of find() this is ok. However, in rfind() it causes troubles. When we see the word start we expect that the computer will work relative to that starting location. In the case of find(), we expect it to go forward from start and in rfind(), we expect it to go backwards from start. This is not really how rfind() works though. When we did not specify starting or stopping places for find() and rfind() it assumed them. It took them to be the beginning of the string we are working with and the end of the string we are working with. rfind() always works from the end towards the beginning. So the issue we were having here, is we were telling it a starting place and it was assuming an ending place. Then it was working from the end. To correct this we can specify some other ending place...
>>> nextToLastB = page2.rfind("B",0,LastB)
>>> print nextToLastB
5

O is the starting point for any string so we tell it that that is the beginning and to use LastB as the end.




One last quick blurb of code and then we are done with this part.
 
    dataRow = htmlInfo.find("<tr>",index)
    start = htmlInfo.find('align="right',dataRow)
    dateStart = htmlInfo.find(">",start)+1
    dateEnd = htmlInfo.find("</td>",dateStart)
    outputFile.write(htmlInfo[dateStart:dateEnd]+"\t")
    .
    .
    .
    outputFile.write(htmlInfo[adjCloseStart:adjCloseEnd]+"\n")
    index = adjCloseEnd + 1
  outputFile.close()


Here are the lines of code that will find the data and write it to an outputFile. First note, the ". . ."s. There is a deal of code left out because its explanation would be redundant.
  1. This line finds a row in the html file. It searches the area after index.(Remember index is where we have determined that the data we want starts.
  2. This line finds the location right before the data.
  3. This line gives the actual starting location of the data.
  4. This line finds the end of the data
  5. This line writes out to a file the data between dateStart and dateEnd. After it has written out that, it adds right next to it a tab.
  6. This line is similar to the one just explained except that it writes out a different segment of data and it add the newline right after that. If it did not our file would be one long string and breaking it down with Excel would be difficult.
  7. This is the update step for this loop. It ensures that we are always moving forward towards our stopping point.
  8. Finally. This line closes the file we have been writing too.

Now, if you've gathered some or any of how this all works internally, run ExcelLab(). It will generate a file with whatever name you specified. In my case, I choose DisneyPrices.txt. Review the bonus below and see if you want to attempt it. After you have done that, go onto the ExcelPortion

BONUS POINTS +5(x2)

while(index <= stoppingPoint):

    # Find a Row of Data
    dataRow = htmlInfo.find("<tr>",index)
    
    while( <something is true>):
      <do Something>
      <Update Something>

    #Find Adj Close
    start = htmlInfo.find('align="right',volumeEnd)
    adjCloseStart = htmlInfo.find(">",start)+1
    adjCloseEnd = htmlInfo.find("</td>",adjCloseStart)
    #Write the adjusted closing price  and add a newline to the end
    outputFile.write(htmlInfo[adjCloseStart:adjCloseEnd]+"\n")

    index = adjCloseEnd + 1
  outputFile.close()



Turnin

You will use JES to turn in this portion of the lab. Save your modified code in JES as ExcelCode.py. Keep it on your harddrive until you turn in the Excel portion of the Lab. More instructions will be under the turnin section of the Excel lab.


Ask Questions on the Excel Lab on the online poker.

Link to this Page