pandas教程:Interacting with Web APIs API和数据库的交互
文章目录
- 6.3 Interacting with Web APIs (网络相关的API交互)
- 6.4 Interacting with Databases(与数据库的交互)
6.3 Interacting with Web APIs (网络相关的API交互)
很多网站都有公开的API
,通过JSON
等格式提供数据流。有很多方法可以访问这些API
,这里推荐一个易用的requests
包。
找到github
里pandas
最新的30个issues
,制作一个GET HTTP request
, 通过使用requests
包:
import pandas as pd
import numpy as np
import requests
url = ''
resp = requests.get(url)
resp
<Response [200]>
response
的json
方法能返回一个dict
,包含可以解析为python object
的JSON
:
data = resp.json()
data[0]['title']
'Optimize data type'
data[0]
{'assignee': None,'assignees': [],'author_association': 'NONE','body': 'Hi guys, i\'m user of mysql\r\nwe have an "function" PROCEDURE ANALYSE\r\n.5/en/procedure-analyse.html\r\n\r\nit get all "dataframe" and show what\'s the best "dtype", could we do something like it in Pandas?\r\n\r\nthanks!','closed_at': None,'comments': 1,'comments_url': '','created_at': '2017-11-13T22:51:32Z','events_url': '','html_url': '','id': 273606786,'labels': [],'labels_url': '{/name}','locked': False,'milestone': None,'number': 18272,'repository_url': '','state': 'open','title': 'Optimize data type','updated_at': '2017-11-13T22:57:27Z','url': '','user': {'avatar_url': '=4','events_url': '{/privacy}','followers_url': '','following_url': '{/other_user}','gists_url': '{/gist_id}','gravatar_id': '','html_url': '','id': 2468782,'login': 'rspadim','organizations_url': '','received_events_url': '','repos_url': '','site_admin': False,'starred_url': '{/owner}{/repo}','subscriptions_url': '','type': 'User','url': ''}}
data
中的每一个元素都是一个dict
,这个dict
就是在github
上找到的issue
页面上的信息。我们可以把data
传给DataFrame
并提取感兴趣的部分:
issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])
issues
number | title | labels | state | |
---|---|---|---|---|
0 | 18272 | Optimize data type | [] | open |
1 | 18271 | BUG: Series.rank(pct=True).max() != 1 for a la... | [] | open |
2 | 18270 | (Series|DataFrame) datetimelike ops | [] | open |
3 | 18268 | DOC: update Seriesbine/DataFramebine d... | [] | open |
4 | 18266 | DOC: updated bine_first doc strings | [{'url': '... | open |
5 | 18265 | Calling DataFrame.stack on an out-of-order col... | [] | open |
6 | 18264 | cleaned up imports | [{'url': '... | open |
7 | 18263 | Tslibs offsets paramd | [] | open |
8 | 18262 | DEPR: let's deprecate | [{'url': '... | open |
9 | 18258 | DEPR: deprecate (Sparse)Series.from_array | [{'url': '... | open |
10 | 18255 | ENH/PERF: Add cache='infer' to to_datetime | [{'url': '... | open |
11 | 18250 | Categorical.replace() unexpectedly returns non... | [{'url': '... | open |
12 | 18246 | pandas.MultiIndex.reorder_levels has no inplac... | [] | open |
13 | 18245 | TST: test tz-aware DatetimeIndex as separate m... | [{'url': '... | open |
14 | 18244 | RLS 0.21.1 | [{'url': '... | open |
15 | 18243 | DEPR: deprecate .ftypes, get_ftype_counts | [{'url': '... | open |
16 | 18242 | CLN: Remove days, seconds and microseconds pro... | [{'url': '... | open |
17 | 18241 | DEPS: drop 2.7 support | [{'url': '... | open |
18 | 18238 | BUG: Fix filter method so that accepts byte an... | [{'url': '... | open |
19 | 18237 | Deprecate Series.asobject, Index.asobject, ren... | [{'url': '... | open |
20 | 18236 | df.plot() very slow compared to explicit matpl... | [{'url': '... | open |
21 | 18235 | Quarter.onOffset looks fishy | [] | open |
22 | 18231 | Reduce copying of input data on Series constru... | [{'url': '... | open |
23 | 18226 | Patch __init__ to prevent passing invalid kwds | [{'url': '... | open |
24 | 18222 | DataFrame.plot() produces incorrect legend lab... | [{'url': '... | open |
25 | 18220 | DataFrame.groupy renames columns when given a ... | [] | open |
26 | 18217 | Deprecate Index.summary | [{'url': '... | open |
27 | 18216 | Pass kwargs from read_parquet() to the underly... | [{'url': '... | open |
28 | 18215 | DOC/DEPR: ensure that @deprecated functions ha... | [{'url': '... | open |
29 | 18213 | Deprecate Series.from_array ? | [{'url': '... | open |
6.4 Interacting with Databases(与数据库的交互)
如果在工作中,大部分数据并不会以text
或excel
的格式存储。最广泛使用的是SQL-based
的关系型数据库(SQL Server,PostgreSQL,MySQL
)。选择数据库通常取决于性能,数据整合性,实际应用的可扩展性。
读取SQL
到DataFrame
非常直观,pandas
中有一些函数能简化这个过程。举个例子,这里创建一个SQLite
数据库,通过使用python
内建的sqlite3 driver
:
import sqlite3
import pandas as pd
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),c REAL, d INTEGER
);"""
con = sqlite3.connect('../examples/mydata.sqlite')
con.execute(query)
<sqlite3.Cursor at 0x1049931f0>
conmit()
然后我们插入几行数据:
data = [('Atlanta', 'Georgia', 1.25, 6),('Tallahassee', 'Florida', 2.6, 3),('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
<sqlite3.Cursor at 0x1049932d0>
conmit()
大部分python
的SQL
驱动(PyODBC, psycopg2, MySQLdb, pymssql
, 等)返回a list of tuple
,当从一个表格选择数据的时候:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),('Tallahassee', 'Florida', 2.6, 3),('Sacramento', 'California', 1.7, 5)]
我们可以把list of tuples
传递给DataFrame
,但是我们也需要column names
,包含cursor
的description
属性:
cursor.description
(('a', None, None, None, None, None, None),('b', None, None, None, None, None, None),('c', None, None, None, None, None, None),('d', None, None, None, None, None, None))
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
a | b | c | d | |
---|---|---|---|---|
0 | Atlanta | Georgia | 1.25 | 6 |
1 | Tallahassee | Florida | 2.60 | 3 |
2 | Sacramento | California | 1.70 | 5 |
我们不希望每次询问数据库的时候都重复以上步骤,这样对计算机很不好(逐步对计算机系统或文件做小改动导致大的损害)。SQLAlchemy
计划是一个六星的Python SQL
工具箱,它能抽象出不同SQL
数据库之间的不同。pandas
有一个read_sql
函数,能让我们从SQLAlchemy connection
从读取数据。这里我们用SQLAlchemy
连接到同一个SQLite
数据库,并从之前创建的表格读取数据:
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///../examples/mydata.sqlite')
pd.read_sql('select * from test', db)
a | b | c | d | |
---|---|---|---|---|
0 | Atlanta | Georgia | 1.25 | 6 |
1 | Tallahassee | Florida | 2.60 | 3 |
2 | Sacramento | California | 1.70 | 5 |
pandas教程:Interacting with Web APIs API和数据库的交互
文章目录
- 6.3 Interacting with Web APIs (网络相关的API交互)
- 6.4 Interacting with Databases(与数据库的交互)
6.3 Interacting with Web APIs (网络相关的API交互)
很多网站都有公开的API
,通过JSON
等格式提供数据流。有很多方法可以访问这些API
,这里推荐一个易用的requests
包。
找到github
里pandas
最新的30个issues
,制作一个GET HTTP request
, 通过使用requests
包:
import pandas as pd
import numpy as np
import requests
url = ''
resp = requests.get(url)
resp
<Response [200]>
response
的json
方法能返回一个dict
,包含可以解析为python object
的JSON
:
data = resp.json()
data[0]['title']
'Optimize data type'
data[0]
{'assignee': None,'assignees': [],'author_association': 'NONE','body': 'Hi guys, i\'m user of mysql\r\nwe have an "function" PROCEDURE ANALYSE\r\n.5/en/procedure-analyse.html\r\n\r\nit get all "dataframe" and show what\'s the best "dtype", could we do something like it in Pandas?\r\n\r\nthanks!','closed_at': None,'comments': 1,'comments_url': '','created_at': '2017-11-13T22:51:32Z','events_url': '','html_url': '','id': 273606786,'labels': [],'labels_url': '{/name}','locked': False,'milestone': None,'number': 18272,'repository_url': '','state': 'open','title': 'Optimize data type','updated_at': '2017-11-13T22:57:27Z','url': '','user': {'avatar_url': '=4','events_url': '{/privacy}','followers_url': '','following_url': '{/other_user}','gists_url': '{/gist_id}','gravatar_id': '','html_url': '','id': 2468782,'login': 'rspadim','organizations_url': '','received_events_url': '','repos_url': '','site_admin': False,'starred_url': '{/owner}{/repo}','subscriptions_url': '','type': 'User','url': ''}}
data
中的每一个元素都是一个dict
,这个dict
就是在github
上找到的issue
页面上的信息。我们可以把data
传给DataFrame
并提取感兴趣的部分:
issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])
issues
number | title | labels | state | |
---|---|---|---|---|
0 | 18272 | Optimize data type | [] | open |
1 | 18271 | BUG: Series.rank(pct=True).max() != 1 for a la... | [] | open |
2 | 18270 | (Series|DataFrame) datetimelike ops | [] | open |
3 | 18268 | DOC: update Seriesbine/DataFramebine d... | [] | open |
4 | 18266 | DOC: updated bine_first doc strings | [{'url': '... | open |
5 | 18265 | Calling DataFrame.stack on an out-of-order col... | [] | open |
6 | 18264 | cleaned up imports | [{'url': '... | open |
7 | 18263 | Tslibs offsets paramd | [] | open |
8 | 18262 | DEPR: let's deprecate | [{'url': '... | open |
9 | 18258 | DEPR: deprecate (Sparse)Series.from_array | [{'url': '... | open |
10 | 18255 | ENH/PERF: Add cache='infer' to to_datetime | [{'url': '... | open |
11 | 18250 | Categorical.replace() unexpectedly returns non... | [{'url': '... | open |
12 | 18246 | pandas.MultiIndex.reorder_levels has no inplac... | [] | open |
13 | 18245 | TST: test tz-aware DatetimeIndex as separate m... | [{'url': '... | open |
14 | 18244 | RLS 0.21.1 | [{'url': '... | open |
15 | 18243 | DEPR: deprecate .ftypes, get_ftype_counts | [{'url': '... | open |
16 | 18242 | CLN: Remove days, seconds and microseconds pro... | [{'url': '... | open |
17 | 18241 | DEPS: drop 2.7 support | [{'url': '... | open |
18 | 18238 | BUG: Fix filter method so that accepts byte an... | [{'url': '... | open |
19 | 18237 | Deprecate Series.asobject, Index.asobject, ren... | [{'url': '... | open |
20 | 18236 | df.plot() very slow compared to explicit matpl... | [{'url': '... | open |
21 | 18235 | Quarter.onOffset looks fishy | [] | open |
22 | 18231 | Reduce copying of input data on Series constru... | [{'url': '... | open |
23 | 18226 | Patch __init__ to prevent passing invalid kwds | [{'url': '... | open |
24 | 18222 | DataFrame.plot() produces incorrect legend lab... | [{'url': '... | open |
25 | 18220 | DataFrame.groupy renames columns when given a ... | [] | open |
26 | 18217 | Deprecate Index.summary | [{'url': '... | open |
27 | 18216 | Pass kwargs from read_parquet() to the underly... | [{'url': '... | open |
28 | 18215 | DOC/DEPR: ensure that @deprecated functions ha... | [{'url': '... | open |
29 | 18213 | Deprecate Series.from_array ? | [{'url': '... | open |
6.4 Interacting with Databases(与数据库的交互)
如果在工作中,大部分数据并不会以text
或excel
的格式存储。最广泛使用的是SQL-based
的关系型数据库(SQL Server,PostgreSQL,MySQL
)。选择数据库通常取决于性能,数据整合性,实际应用的可扩展性。
读取SQL
到DataFrame
非常直观,pandas
中有一些函数能简化这个过程。举个例子,这里创建一个SQLite
数据库,通过使用python
内建的sqlite3 driver
:
import sqlite3
import pandas as pd
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),c REAL, d INTEGER
);"""
con = sqlite3.connect('../examples/mydata.sqlite')
con.execute(query)
<sqlite3.Cursor at 0x1049931f0>
conmit()
然后我们插入几行数据:
data = [('Atlanta', 'Georgia', 1.25, 6),('Tallahassee', 'Florida', 2.6, 3),('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
<sqlite3.Cursor at 0x1049932d0>
conmit()
大部分python
的SQL
驱动(PyODBC, psycopg2, MySQLdb, pymssql
, 等)返回a list of tuple
,当从一个表格选择数据的时候:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),('Tallahassee', 'Florida', 2.6, 3),('Sacramento', 'California', 1.7, 5)]
我们可以把list of tuples
传递给DataFrame
,但是我们也需要column names
,包含cursor
的description
属性:
cursor.description
(('a', None, None, None, None, None, None),('b', None, None, None, None, None, None),('c', None, None, None, None, None, None),('d', None, None, None, None, None, None))
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
a | b | c | d | |
---|---|---|---|---|
0 | Atlanta | Georgia | 1.25 | 6 |
1 | Tallahassee | Florida | 2.60 | 3 |
2 | Sacramento | California | 1.70 | 5 |
我们不希望每次询问数据库的时候都重复以上步骤,这样对计算机很不好(逐步对计算机系统或文件做小改动导致大的损害)。SQLAlchemy
计划是一个六星的Python SQL
工具箱,它能抽象出不同SQL
数据库之间的不同。pandas
有一个read_sql
函数,能让我们从SQLAlchemy connection
从读取数据。这里我们用SQLAlchemy
连接到同一个SQLite
数据库,并从之前创建的表格读取数据:
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///../examples/mydata.sqlite')
pd.read_sql('select * from test', db)
a | b | c | d | |
---|---|---|---|---|
0 | Atlanta | Georgia | 1.25 | 6 |
1 | Tallahassee | Florida | 2.60 | 3 |
2 | Sacramento | California | 1.70 | 5 |