Python 30 天 – 第 27 天 – 机器学习和数据科学
本文是 30 天 Python 挑战系列的一部分。您可以在此处找到本系列之前所有帖子的链接
是时候深入研究一些真正的机器学习和数据科学编码概念了。今天我主要专注于开始使用 Jupyter Notebook 工作流程并创建一个基本项目来了解它是如何工作的。最后我搜索了一些数据集,然后根据机器学习的基本原理从中生成有用的信息。我也会分享我创建的笔记。Jupyter Notebooks 的伟大之处在于,它可以像博客文章或文章一样组织起来,并带有交互式代码、数据和其他信息。
使用 Jupyter Notebook
我想提供一些很酷的资源的参考,以了解 Jupyter Notebook 界面、安装指南及其工作流程概述。
- Jupyter Notebook 教程视频
- 安装指南 – (建议使用 Anaconda 工具包安装它,因为它带有很多有用的工具。)
由于我是 Windows 用户,我想提供一个快速提示。
在 Windows 中,从开始菜单打开 Anaconda Prompt,导航到要创建 Jupyter 项目的目录,然后运行命令 Jupyter notebook。它将在浏览器中打开笔记本。
根据机器学习和数据科学的基本步骤,我们将创建项目并创建一个可读的笔记本,记录整个过程,然后可以与任何人共享。
使用 Netflix Shows 项目的数据科学和机器学习基础知识
ML 和数据科学的基本步骤是,
- 从某个来源导入数据
- 如果需要,清理数据以删除任何不相关的数据
- 将数据拆分为训练集和测试集。
- 创建模型或算法或函数
- 检查输出
- 改进并重复上述步骤
我们将探索这个基本项目的前两个步骤
导入数据和操作
对于机器学习和数据科学来说,首要也是最重要的事情是数据本身。为了获得好的有意义的结论,我们必须有好的数据集。可以通过多种方式收集这些输入数据——从数据库、抓取网站、公共 API 或公共共享数据集。
Kaggle 是一个深受机器学习和数据科学爱好者欢迎的网站,在这里可以找到大量公开共享的数据集。
我决定搜索 Netflix Shows 数据集,并从 Kaggle 中找到了这个数据集 – https://www.kaggle.com/shivamb/netflix-shows。它包含将用于该项目的 CSV 格式的数据。下载文件后,可以放在项目的根目录下。我将其命名为 netflix_titles.csv
由于这些数据是一种表格格式,意味着它按行和列排列,pandas 是一个很好的开源库,可以处理和分析这类数据。它与 Anaconda 工具包一起提供,因此可以直接在笔记本中使用。
import pandas as pd
data_frame = pd.read_csv('netflix_titles.csv')
data_frame.head(10) # show first 10 results
# prints the data frame in as a table
<div class=“table-wrapper”>
<table border=“1” class=“dataframe”>
<thead>
<tr style=“text-align: right;”>
<th></th>
<th>show_id</th>
<th>type</th>
<th>title</th>
<th>director</th>
<th>cast</th>
<th>country</th>
<th>date_added</th>
<th>release_year</th>
<th>rating</th>
<th>duration</th>
<th>listed_in</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>81145628</td>
<td>Movie</td>
<td>Norm of the North: King Sized Adventure</td>
<td>Richard Finn, Tim Maltby</td>
<td>Alan Marriott, Andrew Toth, Brian Dobson, Cole…</td>
<td>United States, India, South Korea, China</td>
<td>September 9, 2019</td>
<td>2019</td>
<td>TV-PG</td>
<td>90 min</td>
<td>Children & Family Movies, Comedies</td>
<td>Before planning an awesome wedding for his gra…</td>
</tr>
<tr>
<th>1</th>
<td>80117401</td>
<td>Movie</td>
<td>Jandino: Whatever it Takes</td>
<td>NaN</td>
<td>Jandino Asporaat</td>
<td>United Kingdom</td>
<td>September 9, 2016</td>
<td>2016</td>
<td>TV-MA</td>
<td>94 min</td>
<td>Stand-Up Comedy</td>
<td>Jandino Asporaat riffs on the challenges of ra…</td>
</tr>
<tr>
<th>2</th>
<td>70234439</td>
<td>TV Show</td>
<td>Transformers Prime</td>
<td>NaN</td>
<td>Peter Cullen, Sumalee Montano, Frank Welker, J…</td>
<td>United States</td>
<td>September 8, 2018</td>
<td>2013</td>
<td>TV-Y7-FV</td>
<td>1 Season</td>
<td>Kids’ TV</td>
<td>With the help of three human allies, the Autob…</td>
</tr>
<tr>
<th>3</th>
<td>80058654</td>
<td>TV Show</td>
<td>Transformers: Robots in Disguise</td>
<td>NaN</td>
<td>Will Friedle, Darren Criss, Constance Zimmer, …</td>
<td>United States</td>
<td>September 8, 2018</td>
<td>2016</td>
<td>TV-Y7</td>
<td>1 Season</td>
<td>Kids’ TV</td>
<td>When a prison ship crash unleashes hundreds of…</td>
</tr>
<tr>
<th>4</th>
<td>80125979</td>
<td>Movie</td>
<td>#realityhigh</td>
<td>Fernando Lebrija</td>
<td>Nesta Cooper, Kate Walsh, John Michael Higgins…</td>
<td>United States</td>
<td>September 8, 2017</td>
<td>2017</td>
<td>TV-14</td>
<td>99 min</td>
<td>Comedies</td>
<td>When nerdy high schooler Dani finally attracts…</td>
</tr>
<tr>
<th>5</th>
<td>80163890</td>
<td>TV Show</td>
<td>Apaches</td>
<td>NaN</td>
<td>Alberto Ammann, Eloy Azorín, Verónica Echegui,…</td>
<td>Spain</td>
<td>September 8, 2017</td>
<td>2016</td>
<td>TV-MA</td>
<td>1 Season</td>
<td>Crime TV Shows, International TV Shows, Spanis…</td>
<td>A young journalist is forced into a life of cr…</td>
</tr>
<tr>
<th>6</th>
<td>70304989</td>
<td>Movie</td>
<td>Automata</td>
<td>Gabe Ibáñez</td>
<td>Antonio Banderas, Dylan McDermott, Melanie Gri…</td>
<td>Bulgaria, United States, Spain, Canada</td>
<td>September 8, 2017</td>
<td>2014</td>
<td>R</td>
<td>110 min</td>
<td>International Movies, Sci-Fi & Fantasy, Thrillers</td>
<td>In a dystopian future, an insurance adjuster f…</td>
</tr>
<tr>
<th>7</th>
<td>80164077</td>
<td>Movie</td>
<td>Fabrizio Copano: Solo pienso en mi</td>
<td>Rodrigo Toro, Francisco Schultz</td>
<td>Fabrizio Copano</td>
<td>Chile</td>
<td>September 8, 2017</td>
<td>2017</td>
<td>TV-MA</td>
<td>60 min</td>
<td>Stand-Up Comedy</td>
<td>Fabrizio Copano takes audience participation t…</td>
</tr>
<tr>
<th>8</th>
<td>80117902</td>
<td>TV Show</td>
<td>Fire Chasers</td>
<td>NaN</td>
<td>NaN</td>
<td>United States</td>
<td>September 8, 2017</td>
<td>2017</td>
<td>TV-MA</td>
<td>1 Season</td>
<td>Docuseries, Science & Nature TV</td>
<td>As California’s 2016 fire season rages, brave …</td>
</tr>
<tr>
<th>9</th>
<td>70304990</td>
<td>Movie</td>
<td>Good People</td>
<td>Henrik Ruben Genz</td>
<td>James Franco, Kate Hudson, Tom Wilkinson, Omar…</td>
<td>United States, United Kingdom, Denmark, Sweden</td>
<td>September 8, 2017</td>
<td>2014</td>
<td>R</td>
<td>90 min</td>
<td>Action & Adventure, Thrillers</td>
<td>A struggling couple can’t believe their luck w…</td>
</tr>
</tbody>
</table>
</div>
data_frame.info()
# shows information about column data types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 6234 non-null int64
1 type 6234 non-null object
2 title 6234 non-null object
3 director 4265 non-null object
4 cast 5664 non-null object
5 country 5758 non-null object
6 date_added 6223 non-null object
7 release_year 6234 non-null int64
8 rating 6224 non-null object
9 duration 6234 non-null object
10 listed_in 6234 non-null object
11 description 6234 non-null object
dtypes: int64(2), object(10)
memory usage: 584.6+ KB
data_frame.shape
# provides information of rows and columns as a tuple
(6234, 12)
data_frame.describe()
# shows some basic description
<div class="table-wrapper">
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>show_id</th>
<th>release_year</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>6.234000e+03</td>
<td>6234.00000</td>
</tr>
<tr>
<th>mean</th>
<td>7.670368e+07</td>
<td>2013.35932</td>
</tr>
<tr>
<th>std</th>
<td>1.094296e+07</td>
<td>8.81162</td>
</tr>
<tr>
<th>min</th>
<td>2.477470e+05</td>
<td>1925.00000</td>
</tr>
<tr>
<th>25%</th>
<td>8.003580e+07</td>
<td>2013.00000</td>
</tr>
<tr>
<th>50%</th>
<td>8.016337e+07</td>
<td>2016.00000</td>
</tr>
<tr>
<th>75%</th>
<td>8.024489e+07</td>
<td>2018.00000</td>
</tr>
<tr>
<th>max</th>
<td>8.123573e+07</td>
<td>2020.00000</td>
</tr>
</tbody>
</table>
</div>
data_frame['title'].head() # lists a specific column data with first 5 entries (head)
0 Norm of the North: King Sized Adventure
1 Jandino: Whatever it Takes
2 Transformers Prime
3 Transformers: Robots in Disguise
4 #realityhigh
Name: title, dtype: object
# Filtering Data
data_frame[data_frame['country'] == 'India'].head()
<div class="table-wrapper">
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>show_id</th>
<th>type</th>
<th>title</th>
<th>director</th>
<th>cast</th>
<th>country</th>
<th>date_added</th>
<th>release_year</th>
<th>rating</th>
<th>duration</th>
<th>listed_in</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<th>35</th>
<td>81154455</td>
<td>Movie</td>
<td>Article 15</td>
<td>Anubhav Sinha</td>
<td>Ayushmann Khurrana, Nassar, Manoj Pahwa, Kumud...</td>
<td>India</td>
<td>September 6, 2019</td>
<td>2019</td>
<td>TV-MA</td>
<td>125 min</td>
<td>Dramas, International Movies, Thrillers</td>
<td>The grim realities of caste discrimination com...</td>
</tr>
<tr>
<th>37</th>
<td>81052275</td>
<td>Movie</td>
<td>Ee Nagaraniki Emaindi</td>
<td>Tharun Bhascker</td>
<td>Vishwaksen Naidu, Sushanth Reddy, Abhinav Goma...</td>
<td>India</td>
<td>September 6, 2019</td>
<td>2018</td>
<td>TV-14</td>
<td>133 min</td>
<td>Comedies, International Movies</td>
<td>In Goa and in desperate need of cash, four chi...</td>
</tr>
<tr>
<th>41</th>
<td>70303496</td>
<td>Movie</td>
<td>PK</td>
<td>Rajkumar Hirani</td>
<td>Aamir Khan, Anuskha Sharma, Sanjay Dutt, Saura...</td>
<td>India</td>
<td>September 6, 2018</td>
<td>2014</td>
<td>TV-14</td>
<td>146 min</td>
<td>Comedies, Dramas, International Movies</td>
<td>Aamir Khan teams with director Rajkumar Hirani...</td>
</tr>
<tr>
<th>58</th>
<td>81155784</td>
<td>Movie</td>
<td>Watchman</td>
<td>A. L. Vijay</td>
<td>G.V. Prakash Kumar, Samyuktha Hegde, Suman, Ra...</td>
<td>India</td>
<td>September 4, 2019</td>
<td>2019</td>
<td>TV-14</td>
<td>93 min</td>
<td>Comedies, Dramas, International Movies</td>
<td>Rushing to pay off a loan shark, a young man b...</td>
</tr>
<tr>
<th>99</th>
<td>80225885</td>
<td>TV Show</td>
<td>Bard of Blood</td>
<td>NaN</td>
<td>Emraan Hashmi, Viineet Kumar, Sobhita Dhulipal...</td>
<td>India</td>
<td>September 27, 2019</td>
<td>2019</td>
<td>TV-MA</td>
<td>1 Season</td>
<td>International TV Shows, TV Action & Adventure,...</td>
<td>Years after a disastrous job in Balochistan, a...</td>
</tr>
</tbody>
</table>
</div>
# Sorting Data
data_frame.sort_values('release_year', ascending=False).head()
<div class="table-wrapper">
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>show_id</th>
<th>type</th>
<th>title</th>
<th>director</th>
<th>cast</th>
<th>country</th>
<th>date_added</th>
<th>release_year</th>
<th>rating</th>
<th>duration</th>
<th>listed_in</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<th>3467</th>
<td>81011449</td>
<td>TV Show</td>
<td>Medical Police</td>
<td>NaN</td>
<td>Erinn Hayes, Rob Huebel, Malin Akerman, Rob Co...</td>
<td>United States</td>
<td>January 10, 2020</td>
<td>2020</td>
<td>TV-MA</td>
<td>1 Season</td>
<td>Crime TV Shows, TV Action & Adventure, TV Come...</td>
<td>Doctors Owen Maestro and Lola Spratt leave Chi...</td>
</tr>
<tr>
<th>3249</th>
<td>81006825</td>
<td>Movie</td>
<td>All the Freckles in the World</td>
<td>Yibrán Asuad</td>
<td>Hánssel Casillas, Loreto Peralta, Andrea Sutto...</td>
<td>Mexico</td>
<td>January 3, 2020</td>
<td>2020</td>
<td>TV-14</td>
<td>90 min</td>
<td>Comedies, International Movies, Romantic Movies</td>
<td>Thirteen-year-old José Miguel is immune to 199...</td>
</tr>
<tr>
<th>3220</th>
<td>80997687</td>
<td>TV Show</td>
<td>Dracula</td>
<td>NaN</td>
<td>Claes Bang, Dolly Wells, John Heffernan</td>
<td>United Kingdom</td>
<td>January 4, 2020</td>
<td>2020</td>
<td>TV-14</td>
<td>1 Season</td>
<td>British TV Shows, International TV Shows, TV D...</td>
<td>The Count Dracula legend transforms with new t...</td>
</tr>
<tr>
<th>3427</th>
<td>81060049</td>
<td>Movie</td>
<td>Leslie Jones: Time Machine</td>
<td>David Benioff, D.B. Weiss</td>
<td>Leslie Jones</td>
<td>United States</td>
<td>January 14, 2020</td>
<td>2020</td>
<td>TV-MA</td>
<td>66 min</td>
<td>Stand-Up Comedy</td>
<td>From trying to seduce Prince to battling sleep...</td>
</tr>
<tr>
<th>3436</th>
<td>80239306</td>
<td>TV Show</td>
<td>The Healing Powers of Dude</td>
<td>NaN</td>
<td>Jace Chapman, Larisa Oleynik, Tom Everett Scot...</td>
<td>NaN</td>
<td>January 13, 2020</td>
<td>2020</td>
<td>TV-G</td>
<td>1 Season</td>
<td>Kids' TV, TV Comedies, TV Dramas</td>
<td>When an 11-year-old boy with social anxiety di...</td>
</tr>
</tbody>
</table>
</div>
这是一个很好的 Python 数据科学备忘单,其中列出了所有常用的 Pandas 方法和属性以及其他数据科学库。
清理数据
下一步是清理数据并删除分析不需要的任何类型的信息。
让我们考虑一个示例用例,我们希望找到适合所有年龄段的 Netflix 喜剧电影和节目(TV-G 等级)。
# Let's select the relevant columns for analysis
df_shows = pd.DataFrame(data_frame, columns=['title','rating', 'listed_in'])
# filter comedy shows
df_comedy_shows = df_shows[df_shows['listed_in'].str.contains('Comed')]
df_comedy_shows.head()
<div class="table-wrapper">
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>title</th>
<th>rating</th>
<th>listed_in</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Norm of the North: King Sized Adventure</td>
<td>TV-PG</td>
<td>Children & Family Movies, Comedies</td>
</tr>
<tr>
<th>1</th>
<td>Jandino: Whatever it Takes</td>
<td>TV-MA</td>
<td>Stand-Up Comedy</td>
</tr>
<tr>
<th>4</th>
<td>#realityhigh</td>
<td>TV-14</td>
<td>Comedies</td>
</tr>
<tr>
<th>7</th>
<td>Fabrizio Copano: Solo pienso en mi</td>
<td>TV-MA</td>
<td>Stand-Up Comedy</td>
</tr>
<tr>
<th>10</th>
<td>Joaquín Reyes: Una y no más</td>
<td>TV-MA</td>
<td>Stand-Up Comedy</td>
</tr>
</tbody>
</table>
</div>
# filter shows for all ages
df_all_ages = df_comedy_shows[df_comedy_shows['rating']=='TV-G']
df_all_ages.head()
<div class="table-wrapper">
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>title</th>
<th>rating</th>
<th>listed_in</th>
</tr>
</thead>
<tbody>
<tr>
<th>1034</th>
<td>Luccas Neto in: Summer Camp</td>
<td>TV-G</td>
<td>Children & Family Movies, Comedies</td>
</tr>
<tr>
<th>1043</th>
<td>A Holiday Engagement</td>
<td>TV-G</td>
<td>Children & Family Movies, Comedies, Romantic M...</td>
</tr>
<tr>
<th>1205</th>
<td>A Fairly Odd Summer</td>
<td>TV-G</td>
<td>Children & Family Movies, Comedies</td>
</tr>
<tr>
<th>1206</th>
<td>Bella and the Bulldogs</td>
<td>TV-G</td>
<td>Kids' TV, TV Comedies</td>
</tr>
<tr>
<th>1211</th>
<td>Jinxed</td>
<td>TV-G</td>
<td>Children & Family Movies, Comedies</td>
</tr>
</tbody>
</table>
</div>
这就是今天的帖子。明天我将继续探索机器学习和数据科学的更多其他步骤,并通过构建图表和图表以及创建机器学习模型来对数据进行可视化分析。
常见问题FAQ
- 程序仅供学习研究,请勿用于非法用途,不得违反国家法律,否则后果自负,一切法律责任与本站无关。
- 请仔细阅读以上条款再购买,拍下即代表同意条款并遵守约定,谢谢大家支持理解!