Cloud refers to hosted services over the Internet. For example, Google Drive and GoogleDocs are cloud services. Google Drive allows you to save your files on their hardware. GoogleDocs is a collection of software (word processor, spreadsheet, etc.) hosted on Google's servers. The software resided on Google's hardware and uses their memory and CPU. You access it through the Internet. The are three category of cloud services:
Unit testing functionality is built into Python3. Following is a very simple example:
def Arithmetic(a) return a * a
Save this file as mycode.py. Save the following code as mycode_unittest.py
See the following program. Save as datatype.py:
a = 100 # integer b = 1.23 # float c = "python" # string # print variables print(a) print(b) print(c) # convert int to float print(float(100)) # convert float to int print(int(3.14)) # convert string to int d = "12" e = "12.3" print(int(d)) # print(int(e)) - this will generate an error # convert string to float print(float(d)) print(float(e)) # convert int to string f = str(12) print(type(f))
This page shows how to work with text files using Python3 code. Following is a sample txt file we will be using. Lets called it stocks.txt
bce.to hnd.to mtl.to
Reading from File
You can use read(), readline(), or readlines() function to read from a file. This example uses read()
stocks = open("data/stocks.txt","r") stocks.read() stocks.close()
Get Current Date
import datetime as dt now = dt.datetime.now() print(now.year) print(now.month) print(now.day)
Download File from Internet
import urllib.request url = 'http://molecularsciences.org' response = urllib.request.urlopen(url) mydata = response.read() mytext = text.decode('utf-8') print(mytext)
Hive is an SQL language that processes and analyzes data in Hadoop. It does not require knowledge of any programming language. Hive is not suitable for OLTP, it is designed for analyzing big data.
Hadoop in not a database, it is an ecosystem of tools that enables the features we require and desire when dealing with big data. Hadoop runs on HDFS and its native language is MapReduce. Hive converts your SQL commands to MapReduce. Hive also supports workflow integration with other tools such as Excel or Cognos.
Spark a distributed computing platform that is built on top of Hadoop MapReduce. It extends the MapReduce model and make it easier to code and faster to execute.
Spark provides API in Java, Scala, and Python. Any of these languages can be used to create Spark applications.
Spark supports, Map, Reduce, SQL queries, streaming data, machine learning algorithms, and graph algorithms.
Spark stack contains:
The join clause allows us to combine SQL tables. Tables can be combined in different ways. Think on inner join as an intersection of two tables and an outer join as a union of two tables. This table explains different types of joins.
We will be using the following two tables in the examples:
The goal of database normalization is to reduce data redundancy and to improve data integrity. This is done by organizing columns and tables by apply the normal forms. In theoretical database speak, a table is called a relation and a column is called an attribute.
When planning for a data science project, you need to spend to seek clarification about the project. The important questions are:
- What are the goals of the project?
- What resources are required for this project?
- Are there any important deadlines?
- Who are the stakeholders?
Once you have sufficiently gained clarity on your project: