有人可以告诉我最后一个循环在做什么吗？

Hman 发表于 Dev

人

import os
import tarfile
from six.moves import urllib
import pandas as pd
import hashlib
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit 

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path= HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
#getting the housing data 


def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)
#that function loaded the data in a panda datafrome object 


#need to call the function to get the housing data 
fetch_housing_data()
housing = load_housing_data()
housing.head()

#total bedrooms doesnt match entries deal with later 
#ocean proximity holds an object, since its in csv file still can contain text
housing.describe()
#describes the output of the housing information 


%matplotlib inline 
import matplotlib.pyplot as plt 
housing.hist(bins=50,figsize=(20,15))
plt.show()
#creates a histogram of the data set, x axis is the range of hosuing prices, y axis number of instances of housing prices at that 
#given range 
#income data has been scaled by max 15 and .5 for lower 

#since the data of housing prices has been capped at 500k posssible delete that data set 
#thus so our model wont learn those bad values because it may not be 500k thus labels could be off 
#tail heavy because its 200K plus for example so just barel a dollar more would make it (left)

import numpy as np 

def split_train_test(data,test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    #a randomized array with the same length as the input data so all data 
    test_set_size = int(len(data)*test_ratio)
    #mutliplying by a ratio to see the difference of the data 
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    #taking the test of the beggining because of the entry 
    #taking rest for training 
    return data.iloc[train_indices],data.iloc[test_indices]
#redo the variable since outside the cells 
housing = load_housing_data()

#creating a category of income prices that is stratified 
housing["income_cat"] = np.ceil(housing["median_income"]/1.5)
housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace = True)
#since now the income has been set into categories 
#stratified because not even split reprisentative of the population 
split = StratifiedShuffleSplit(n_splits=1,test_size = 0.2,random_state=42)

这是代码末尾的循环

for train_index,test_index in split.split(housing,housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

有人可以告诉我最后一个for循环在做什么吗？基本上，它应该将数据集分层进行训练和测试，但是我尤其对循环头感到困惑，因为为什么整个数据框对象都在第一个参数中，然后在其后是收入类别部分。参照创建的每个收入类别进行分层，从而操纵整个数据框对象中的所有后续类别吗？

山雀花

我确定您已阅读：http : //scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit.split

因此split需要两个变量：

housing：训练数据，其中n_samples是样本数，n_features是特征数。

housing [“ income_cat”]：监督学习问题的目标变量。根据y标签进行分层。

它将返回一个包含2个条目的元组数组（其中每个都是ndarray）：

第一项：该分组的训练集索引。

第二项：该拆分的测试集索引。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-11-30

我来说两句

0 条评论

登录后参与评论

上一篇：可观察到的更新时，所选选项不会更新，尽管optionsValue确实会更新

有人可以告诉我最后一个循环在做什么吗？

有人可以告诉我最后一个循环在做什么吗？

Qt Creator Windows 10 - “使用 jom 而不是 nmake”不起作用

使用next.js时出现服务器错误，错误：找不到react-redux上下文值；请确保组件包装在<Provider>中

SQL Server中的非确定性数据类型

Swift 2.1-对单个单元格使用UITableView

如何避免每次重新编译所有文件？

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

应用发明者仅从列表中选择一个随机项一次

在 Avalonia 中是否有带有柱子的 TreeView 或类似的东西？

HttpClient中的角度变化检测

在Wagtail管理员中，如何禁用图像和文档的摘要项？

如何了解DFT结果

Camunda-根据分配的组过滤任务列表

错误：找不到存根。请确保已调用spring-cloud-contract：convert

为什么此后台线程中未处理的异常不会终止我的进程？

构建类似于Jarvis的本地语言应用程序

使用分隔符将成对相邻的数组元素相互连接

您如何通过 Nativescript 中的 Fetch 发出发布请求？

通过iwd从Linux系统上的命令行连接到wifi（适用于Linux的无线守护程序）

使用React / Javascript在Wordpress API中通过ID获取选择的多个帖子/页面

使用 text() 獲取特定文本節點的 XPath