Monday 23 July 2018

Howto Transform TSV to CSV, or Just Remove Tabs

Howto Transform TSV to CSV, or Just Remove Tabs


Unfortunately, statistics and machine learning seem to degenerate into a giant mess of getting data from multiple sources, munging it together, transforming it, and formatting the output, even before you can get to the work proper. A common problem is taking tab separate value (tsv) files, perhaps produced as the output of a mysql or postgres query, and turning them into comma separated value (csv) files.
Here’s one method, using sed and pretty standard regexp syntax:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
work earl$ head for_tyler.csv
day   referer ip
2009-07-06    http://powertwitter.me/ 111113333
2009-07-06    http://powertwitter.me/ 111113333
2009-07-06    http://powertwitter.me/ 111113333
2009-07-06    http://powertwitter.me/ 111113333
2009-07-06    http://twitter.com/ 111113333
2009-07-06    http://twitter.com/home 111113333
2009-07-06    http://twitter.com/home 111113333
2009-07-06    http://twitter.com/home 111113333
2009-07-06    http://twitter.com/home 111113333
work earl$
work earl$ head for_tyler.csv | sed "s/\t/,/g"
day,referer,ip
2009-07-06,http://powertwitter.me/,111113333
2009-07-06,http://powertwitter.me/,111113333
2009-07-06,http://powertwitter.me/,111113333
2009-07-06,http://powertwitter.me/,111113333
2009-07-06,http://twitter.com/,111113333
2009-07-06,http://twitter.com/home,111113333
2009-07-06,http://twitter.com/home,111113333
2009-07-06,http://twitter.com/home,111113333
2009-07-06,http://twitter.com/home,111113333
work earl$ 
work earl$
The key bit above is this: "s/\t/,/g". That says turn every tab (\t) into a comma (,). If you instead preferred to just remove tabs from the file period, you could use sed on "s/\t//g".
So, yet another thing I learned today: the version of sed that ships with MacOS, even through 10.5.7, doesn’t support special character sequences. If the above isn’t working for you, and instead is just replacing every t character in the file with a comma, then try this:
1
$ echo "   a       " | sed "s/\t/,/g"
Note that to type those tabs, you’ll have to hit ctrl-v (^V). If the output isn’t ",a,", then you have to type literal tabs in your sed command. The \t works under reasonable versions of linux; you’ll have to use literal tabs under OS X. Bleh.

0 comments:

Post a Comment