20 KiB
Create a machine learning model with Bash
Bash, Tcsh, or Zsh can help you get ready for machine learning. ![bash logo on green background][1]
[Machine learning][2] is a powerful computing capability for predicting or forecasting things that conventional algorithms find challenging. The machine learning journey begins with collecting and preparing data—a lot of it—then it builds mathematical models based on that data. While multiple tools can be used for these tasks, I like to use the [shell][3].
A shell is an interface for performing operations using a defined language. This language can be invoked interactively or scripted. The concept of the shell was introduced in [Unix][4] operating systems in the 1970s. Some of the most popular shells include [Bash][5], [tcsh][6], and [Zsh][7]. They are available for all operating systems, including Linux, macOS, and Windows, which gives them high portability. For this exercise, I'll use Bash.
This article is an introduction to using a shell for data collection and data preparation. Whether you are a data scientist looking for efficient tools or a shell expert looking at using your skills for machine learning, I hope you will find valuable information here.
The example problem in this article is creating a machine learning model to forecast temperatures for US states. It uses shell commands and scripts to do the following data collection and data preparation steps:
- Download data
- Extract the necessary fields
- Aggregate data
- Make time series
- Create the train, test, and validate data sets
You may be asking why you should do this with shell, when you can do all of it in a machine learning programming language such as [Python][8]. This is a good question. If data processing is performed with an easy, friendly, rich technology like a shell, a data scientist focuses only on machine learning modeling and not the details of a language.
Prerequisites
First, you need to have a shell interpreter installed. If you use Linux or macOS, it will already be installed, and you may already be familiar with it. If you use Windows, try [MinGW][9] or [Cygwin][10].
For more information, see:
- [Bash tutorials][11] here on opensource.com
- The official [shell scripting tutorial][12] by Steve Parker, the creator of Bourne shell
- The [Bash Guide for Beginners][13] by the Linux Documentation Project
- If you need help with a specific command, type
<commandname> --help
in the shell for help; for example:ls --help
.
Get started
Now that your shell is set up, you can start preparing data for the machine learning temperature-prediction problem.
1. Download data
The data for this tutorial comes from the US National Oceanic and Atmospheric Administration (NOAA). You will train your model using the last 10 complete years of data. The data source is at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/, and the data is in .csv format and gzipped.
Download and unzip the data using a [shell script][14]. Use your favorite text editor to create a file named download.sh
and paste in the code below. The comments in the code explain what the commands do:
#!/bin/sh
# This is called hashbang. It identifies the executor used to run this file.
# In this case, the script is executed by shell itself.
# If not specified, a program to execute the script must be specified.
# With hashbang: ./download.sh; Without hashbang: sh ./download.sh;
FROM_YEAR=2010
TO_YEAR=2019
year=$FROM_YEAR
# For all years one by one starting from FROM_YEAR=2010 upto TO_YEAR=2019
while [ $year -le $TO_YEAR ]
do
# show the year being downloaded now
echo $year
# Download
wget <https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by\_year/${year}.csv.gz>
# Unzip
gzip -d ${year}.csv.gz
# Move to next year by incrementing
year=$(($year+1))
done
Notes:
- If you are behind a proxy server, consult Mark Grennan's [how-to][15], and use: [code] export http_proxy=http://username:password@proxyhost:port/ export https_proxy=https://username:password@proxyhost:port/
* Make sure all standard commands are already in your PATH (such as `/bin` or `/usr/bin`). If not, [set your PATH][16].
* [Wget][17] is a utility for connecting to web servers from the command line. If Wget is not installed on your system, [download it][18].
* Make sure you have [gzip][19], a utility used for compression and decompression.
Run this script to download, extract, and make 10 years' worth of data available as CSVs:
$ ./download.sh 2010 --2020-10-30 19:10:47-- https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by\_year/2010.csv.gz Resolving www1.ncdc.noaa.gov (www1.ncdc.noaa.gov)... 205.167.25.171, 205.167.25.172, 205.167.25.178, ... Connecting to www1.ncdc.noaa.gov (www1.ncdc.noaa.gov)|205.167.25.171|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 170466817 (163M) [application/gzip] Saving to: '2010.csv.gz'
0K .......... .......... .......... .......... .......... 0% 69.4K 39m57s 50K .......... .......... .......... .......... .......... 0% 202K 26m49s 100K .......... .......... .......... .......... .......... 0% 1.08M 18m42s
...
The [ls][20] command lists the contents of a folder. Use `ls 20*.csv` to list all your files with names beginning with 20 and ending with .csv.
$ ls 20*.csv 2010.csv 2011.csv 2012.csv 2013.csv 2014.csv 2015.csv 2016.csv 2017.csv 2018.csv 2019.csv
### 2\. Extract average temperatures
Extract the TAVG (average temperature) data from the CSVs for US regions:
**extract_tavg_us.sh**
#!/bin/sh
For each file with name that starts with "20" and ens with ".csv"
for csv_file in ls 20*.csv
do
# Message that says file name $csv_file is extracted to file TAVG_US_$csv_file
# Example: 2010.csv extracted to TAVG_US_2010.csv
echo "$csv_file -> TAVG_US_$csv_file"
# grep "TAVG" $csv_file: Extract lines in file with text "TAVG"
# |: pipe
# grep "^US": From those extract lines that begin with text "US"
# > TAVG_US_$csv_file: Save xtracted lines to file TAVG_US_$csv_file
grep "TAVG" $csv_file | grep "^US" > TAVG_US_$csv_file
done
This script:
$ ./extract_tavg_us.sh 2010.csv -> TAVG_US_2010.csv ... 2019.csv -> TAVG_US_2019.csv
creates these files:
$ ls TAVG_US*.csv TAVG_US_2010.csv TAVG_US_2011.csv TAVG_US_2012.csv TAVG_US_2013.csv TAVG_US_2014.csv TAVG_US_2015.csv TAVG_US_2016.csv TAVG_US_2017.csv TAVG_US_2018.csv TAVG_US_2019.csv
Here are the first few lines for `TAVG_US_2010.csv`:
$ head TAVG_US_2010.csv USR0000AALC,20100101,TAVG,-220,,,U, USR0000AALP,20100101,TAVG,-9,,,U, USR0000ABAN,20100101,TAVG,12,,,U, USR0000ABCA,20100101,TAVG,16,,,U, USR0000ABCK,20100101,TAVG,-309,,,U, USR0000ABER,20100101,TAVG,-81,,,U, USR0000ABEV,20100101,TAVG,-360,,,U, USR0000ABEN,20100101,TAVG,-224,,,U, USR0000ABNS,20100101,TAVG,89,,,U, USR0000ABLA,20100101,TAVG,59,,,U,
The [head][21] command is a utility for displaying the first several lines (by default, 10 lines) of a file.
The data has more information than you need. Limit the number of columns by eliminating column 3 (since all the data is average temperature) and column 5 onward. In other words, keep columns 1 (climate station), 2 (date), and 4 (temperature recorded).
**key_columns.sh**
#!/bin/sh
For each file with name that starts with "TAVG_US_" and ens with ".csv"
for csv_file in ls TAVG_US_*.csv
do
echo "Exractiing columns $csv_file"
# cat $csv_file: 'cat' is to con'cat'enate files - here used to show one year csv file
# |: pipe
# cut -d',' -f1,2,4: Cut columns 1,2,4 with , delimitor
# > $csv_file.cut: Save to temporary file
| > $csv_file.cut:
cat $csv_file | cut -d',' -f1,2,4 > $csv_file.cut
# mv $csv_file.cut $csv_file: Rename temporary file to original file
mv $csv_file.cut $csv_file
# File is processed and saved back into the same
# There are other ways to do this
# Using intermediate file is the most reliable method.
done
Run the script:
$ ./key_columns.sh Extracting columns TAVG_US_2010.csv ... Extracting columns TAVG_US_2019.csv
The first few lines of `TAVG_US_2010.csv` with the unneeded data removed are:
$ head TAVG_US_2010.csv USR0000AALC,20100101,-220 USR0000AALP,20100101,-9 USR0000ABAN,20100101,12 USR0000ABCA,20100101,16 USR0000ABCK,20100101,-309 USR0000ABER,20100101,-81 USR0000ABEV,20100101,-360 USR0000ABEN,20100101,-224 USR0000ABNS,20100101,89 USR0000ABLA,20100101,59
Dates are in string form (YMD). To train your model correctly, your algorithms need to recognize date fields in the comma-separated Y,M,D form (For example, `20100101` becomes `2010,01,01`). You can convert them with the [sed][22] utility.
**date_format.sh**
for csv_file in ls TAVG_*.csv
do
echo Date formatting $csv_file
# This inserts , after year
sed -i 's/,..../&,/' $csv_file
# This inserts , after month
sed -i 's/,....,../&,/' $csv_file
done
Run the script:
$ ./date_format.sh Date formatting TAVG_US_2010.csv ... Date formatting TAVG_US_2019.csv
The first few lines of `TAVG_US_2010.csv` with the comma-separated date format are:
$ head TAVG_US_2010.csv USR0000AALC,2010,01,01,-220 USR0000AALP,2010,01,01,-9 USR0000ABAN,2010,01,01,12 USR0000ABCA,2010,01,01,16 USR0000ABCK,2010,01,01,-309 USR0000ABER,2010,01,01,-81 USR0000ABEV,2010,01,01,-360 USR0000ABEN,2010,01,01,-224 USR0000ABNS,2010,01,01,89 USR0000ABLA,2010,01,01,59
### 3\. Aggregate states' average temperature data
The weather data comes from climate stations located in US cities, but you want to forecast whole states' temperatures. To convert the climate-station data to state data, first, map climate stations to their states.
Download the list of climate stations using wget:
$ wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
Extract the US stations with the [grep][23] utility to find US listings. The following command searches for lines that begin with the text `"US`." The `>` is a [redirection][24] that writes output to a file—in this case, to a file named `us_stations.txt`:
$ grep "^US" ghcnd-stations.txt > us_stations.txt
This file was created for pretty print, so the column separators are inconsistent:
$ head us_stations.txt US009052008 43.7333 -96.6333 482.0 SD SIOUX FALLS (ENVIRON. CANADA) US10RMHS145 40.5268 -105.1113 1569.1 CO RMHS 1.6 SSW US10adam001 40.5680 -98.5069 598.0 NE JUNIATA 1.5 S ...
Make them consistent by using [cat][25] to print the file, using [tr][26] to squeeze repeats and output to a temp file, and renaming the temp file back to the original—all in one line:
$ cat us_stations.txt | tr -s ' ' > us_stations.txt.tmp; cp us_stations.txt.tmp us_stations.txt;
The first lines of the command's output:
$ head us_stations.txt US009052008 43.7333 -96.6333 482.0 SD SIOUX FALLS (ENVIRON. CANADA) US10RMHS145 40.5268 -105.1113 1569.1 CO RMHS 1.6 SSW US10adam001 40.5680 -98.5069 598.0 NE JUNIATA 1.5 S ...
This contains a lot of info—GPS coordinates and such—but you only need the station code and state. Use [cut][27]:
$ cut -d' ' -f1,5 us_stations.txt > us_stations.txt.tmp; mv us_stations.txt.tmp us_stations.txt;
The first lines of the command's output:
$ head us_stations.txt US009052008 SD US10RMHS145 CO US10adam001 NE US10adam002 NE ...
Make this a CSV and change the spaces to comma separators using sed:
$ sed -i s/' '/,/g us_stations.txt
The first lines of the command's output:
$ head us_stations.txt US009052008,SD US10RMHS145,CO US10adam001,NE US10adam002,NE ...
Although you used several commands for these tasks, it is possible to perform all the steps in one run. Try it yourself.
Now, replace the station codes with their state locations by using [AWK][28], which is functionally high performant for large data processing.
**station_to_state_data.sh**
PATTERN_FILE=us_stations.txt
for DATA_FILE in ls TAVG_US_*.csv
do
echo ${DATA_FILE}
awk -F,
'FNR==NR { x[$1]=$2; next; } { $1=x[$1]; print $0 }'
OFS=,
${PATTERN_FILE} ${DATA_FILE} > ${DATA_FILE}.tmp
mv ${DATA_FILE}.tmp ${DATA_FILE} done
Here is what these parameters mean:
`-F,` | Field separator is `,`
---|---
`FNR` | Line number in each file
`NR` | Line number in both files together
`FNR==NR` | Is TRUE only in the first file `${PATTERN_FILE}`
`{ x[$1]=$2; next; }` | If `FNR==NR` is TRUE (for all lines in `$PATTERN_FILE` only)
`- x` | Variable to store `station=state` map
`- x[$1]=$2` | Adds data of `station=state` to map
`- $1` | First column in first file (station codes)
`- $2` | Second column in first file (state codes)
`- x` | Map of all stations e.g., `x[US009052008]=SD`, `x[US10RMHS145]=CO`, ..., `x[USW00096409]=AK`
`- next` | Go to next line matching `FNR==NR` (essentially, this creates a map of all stations-states from the `${PATTERN_FILE}`
`{ $1=x[$1]; print $0 }` | If `FNR==NR` is FALSE (for all lines `in $DATA_FILE` only)
`- $1=x[$1]` | Replace first field with `x[$1]`; essentially, replace station code with state code
`- print $0` | Print all columns (including replaced `$1`)
`OFS=,` | Output fields separator is `,`
The CSV with station codes:
$ head TAVG_US_2010.csv USR0000AALC,2010,01,01,-220 USR0000AALP,2010,01,01,-9 USR0000ABAN,2010,01,01,12 USR0000ABCA,2010,01,01,16 USR0000ABCK,2010,01,01,-309 USR0000ABER,2010,01,01,-81 USR0000ABEV,2010,01,01,-360 USR0000ABEN,2010,01,01,-224 USR0000ABNS,2010,01,01,89 USR0000ABLA,2010,01,01,59
Run the command:
$ ./station_to_state_data.sh TAVG_US_2010.csv ... TAVG_US_2019.csv
Stations are now mapped to states:
$ head TAVG_US_2010.csv AK,2010,01,01,-220 AZ,2010,01,01,-9 AL,2010,01,01,12 AK,2010,01,01,16 AK,2010,01,01,-309 AK,2010,01,01,-81 AK,2010,01,01,-360 AK,2010,01,01,-224 AZ,2010,01,01,59 AK,2010,01,01,-68
Every state has several temperature readings for each day, so you need to calculate the average of each state's readings for a day. Use AWK for text processing, [sort][29] to ensure the final results are in a logical order, and [rm][30] to delete the temporary file after processing.
**station_to_state_data.sh**
PATTERN_FILE=us_stations.txt
for DATA_FILE in ls TAVG_US_*.csv
do
echo ${DATA_FILE}
awk -F,
'FNR==NR { x[$1]=$2; next; } { $1=x[$1]; print $0 }'
OFS=,
${PATTERN_FILE} ${DATA_FILE} > ${DATA_FILE}.tmp
mv ${DATA_FILE}.tmp ${DATA_FILE} done
Here is what the AWK parameters mean:
`FILE=$DATA_FILE` | CSV file processed as `FILE`
---|---
`-F,` | Field separator is `,`
`state_day_sum[$1 "," $2 "," $3 "," $4] = $5 state_day_sum[$1 "," $2 "," $3 "," $4] + $5` | Sum of temperature (`$5`) for the state `($1`) on year (`$2`), month (`$3`), day (`$4`)
`state_day_num[$1 "," $2 "," $3 "," $4] = $5 state_day_num[$1 "," $2 "," $3 "," $4] + 1` | Number of temperature readings for the state (`$1`) on year (`$2`), month (`$3`), day (`$4`)
`END` | In the end, after collecting sum and number of readings for all states, years, months, days, calculate averages
`for (state_day_key in state_day_sum)` | For each state-year-month-day
`print state_day_key "," state_day_sum[state_day_key]/state_day_num[state_day_key]` | Print state,year,month,day,average
`OFS=,` | Output fields separator is `,`
`$DATA_FILE` | Input file (all files with name starting with `TAVG_US_` and ending with `.csv`, one by one)
`> STATE_DAY_${DATA_FILE}.tmp` | Save result to a temporary file
Run the script:
$ ./TAVG_avg.sh TAVG_US_2010.csv TAVG_US_2011.csv TAVG_US_2012.csv TAVG_US_2013.csv TAVG_US_2014.csv TAVG_US_2015.csv TAVG_US_2016.csv TAVG_US_2017.csv TAVG_US_2018.csv TAVG_US_2019.csv
These files are created:
$ ls STATE_DAY_TAVG_US_20*.csv STATE_DAY_TAVG_US_2010.csv STATE_DAY_TAVG_US_2015.csv STATE_DAY_TAVG_US_2011.csv STATE_DAY_TAVG_US_2016.csv STATE_DAY_TAVG_US_2012.csv STATE_DAY_TAVG_US_2017.csv STATE_DAY_TAVG_US_2013.csv STATE_DAY_TAVG_US_2018.csv STATE_DAY_TAVG_US_2014.csv STATE_DAY_TAVG_US_2019.csv
See one year of data for all states ([less][31] is a utility to see output a page at a time):
$ less STATE_DAY_TAVG_US_2010.csv AK,2010,01,01,-181.934 ... AK,2010,01,31,-101.068 AK,2010,02,01,-107.11 ... AK,2010,02,28,-138.834 ... WY,2010,01,01,-43.5625 ... WY,2010,12,31,-215.583
Merge all the data files into one:
$ cat STATE_DAY_TAVG_US_20*.csv > TAVG_US_2010-2019.csv
You now have one file, with all states, for all years:
$ cat TAVG_US_2010-2019.csv AK,2010,01,01,-181.934 ... WY,2018,12,31,-167.421 AK,2019,01,01,-32.3386 ... WY,2019,12,30,-131.028 WY,2019,12,31,-79.8704
## 4\. Make time-series data
A problem like this is fittingly addressed with a time-series model such as long short-term memory ([LSTM][32]), which is a recurring neural network ([RNN][33]). This input data is organized into time slices; consider 20 days to be one slice.
This is a one-time slice (as in `STATE_DAY_TAVG_US_2010.csv`):
X (input – 20 weeks): AK,2010,01,01,-181.934 AK,2010,01,02,-199.531 ... AK,2010,01,20,-157.273
y (21st week, prediction for these 20 weeks): AK,2010,01,21,-165.31
This time slice is represented as (temperature values where the first 20 weeks are X, and 21 is y):
AK, -181.934,-199.531, ... , -157.273,-165.3
The slices are time-contiguous. For example, the end of 2010 continues into 2011:
AK,2010,12,22,-209.92 ... AK,2010,12,31,-79.8523 AK,2011,01,01,-59.5658 ... AK,2011,01,10,-100.623
Which results in the prediction:
AK,2011,01,11,-106.851
This time slice is taken as:
AK, -209.92, ... ,-79.8523,-59.5658, ... ,-100.623,-106.851
and so on, for all states, years, months, and dates. For more explanation, see this tutorial on [time-series forecasting][34].
Write a script to create time slices:
**timeslices.sh**
#!/bin/sh
TIME_SLICE_PERIOD=20
file=TAVG_US_2010-2019.csv
For each state in file
for state in cut -d',' -f1 $file | sort | uniq
do
# Get all temperature values for the state
state_tavgs=grep $state $file | cut -d',' -f5
# How many time slices will this result in?
# mber of temperatures recorded minus size of one timeslice
num_slices=echo $state_tavgs | wc -w
num_slices=$((${num_slices} - ${TIME_SLICE_PERIOD}))
# Initialize
slice_start=1; num_slice=0;
# For each timeslice
while [ $num_slice -lt $num_slices ]
do
# One timeslice is from slice_start to slice_end
slice_end=$(($slice_start + $TIME_SLICE_PERIOD - 1))
# X (1-20)
sliceX="$slice_start-$slice_end"
# y (21)
slicey=$(($slice_end + 1))
# Print state and timeslice temperature values (column 1-20 and 21)
echo $state echo $state_tavgs | cut -d' ' -f$sliceX,$slicey
# Increment
slice_start=$(($slice_start + 1)); num_slice=$(($num_slice + 1));
done
done
Run the script. It uses spaces as column separators; make them commas with sed:
$ ./timeslices.sh > TIMESLICE_TAVG_US_2010-2019.csv; sed -i s/' '/,/g TIME_VARIANT_TAVG_US_2010-2019.csv
Here are the first few lines and the last few lines of the output .csv:
$ head -3 TIME_VARIANT_TAVG_US_2009-2019.csv AK,-271.271,-290.057,-300.324,-277.603,-270.36,-293.152,-292.829,-270.413,-256.674,-241.546,-217.757,-158.379,-102.585,-24.9517,-1.7973,15.9597,-5.78231,-33.932,-44.7655,-92.5694,-123.338 AK,-290.057,-300.324,-277.603,-270.36,-293.152,-292.829,-270.413,-256.674,-241.546,-217.757,-158.379,-102.585,-24.9517,-1.7973,15.9597,-5.78231,-33.932,-44.7655,-92.5694,-123.338,-130.829 AK,-300.324,-277.603,-270.36,-293.152,-292.829,-270.413,-256.674,-241.546,-217.757,-158.379,-102.585,-24.9517,-1.7973,15.9597,-5.78231,-33.932,-44.7655,-92.5694,-123.338,-130.829,-123.979
$ tail -3 TIME_VARIANT_TAVG_US_2009-2019.csv WY,-76.9167,-66.2315,-45.1944,-27.75,-55.3426,-81.5556,-124.769,-137.556,-90.213,-54.1389,-55.9907,-30.9167,-9.59813,7.86916,-1.09259,-13.9722,-47.5648,-83.5234,-98.2963,-124.694,-142.898 WY,-66.2315,-45.1944,-27.75,-55.3426,-81.5556,-124.769,-137.556,-90.213,-54.1389,-55.9907,-30.91