Skip to content

Latest commit

 

History

History
110 lines (80 loc) · 6.25 KB

case-insensitive-hashtags.md

File metadata and controls

110 lines (80 loc) · 6.25 KB

🧮 Case insensitive hashtag counting

How to do case insensitive counts of hashtags or other texts...

In digital methods research you may sometimes end up with a list of hashtags where small variations in case and capitalisation are counted as separate items, such as in the following table:

date item value
2021 ESEAHM2021 179
2021 ESEAhm2021 46
2021 eseahm2021 44
2021 ESEA 16
2021 Earth2Air 13
2021 ESEAHM 13
2021 ESEAHeritageMonth 11
2021 NIW2021 10
2021 TheMSGpod 8
2021 MoongateMix 8

This recipe explores simple approaches to changing cases.

🧱 Inputs

  • A dataset of hashtags where case variations are counted separately (as above)

📃 Steps

Converting hashtags to one case

Changing cases with spreadsheets

There are many ways of changing cases of items in datasets with different tools and scripts.

Here's one simple approach using spreadsheets...

  1. Import your dataset into your spreadsheet software
  2. Insert new column in which to add hashtags with their cases all changed
  3. Give the column a meaningful header (e.g. lower if you will change hashtags to lowercase)
  4. In the row next to your first item, type =LOWER() and add the reference for the first item between the brackets.
    • So if you are adding the new cell in C2 and your orginal case sensitive hashtag is B2 then you can type =LOWER(B2) in cell C1.
    • This syntax should work both in Excel as well as in Google Sheets. While lowercase may be easier to read, you could use uppercase instead if you preferred - the main thing is to get all the hashtags in the same case.
  5. Press the enter key to complete and you should see case adjusted hashtag in the new column.
  6. Fill down this formula for the rest of the column

2021-11-09 12 37 29

Changing cases with OpenRefine

Another simple alternative to the approach above you can use OpenRefine...

  1. Download and install OpenRefine. After installation it should open up a browser window.
  2. Click "choose files" and find the CSV file that you have downloaded from TCAT and then click "Next". (CSV stands for "comma-separated values" and this is a simple, widely used format for storing tabular data and moving it between different applications and services.)
  3. Check that the data has been "parsed" correctly and then click "Create Project".
  4. Once the data has loaded you can click the small down arrow next to the column containing your hashtags and click "Edit cells" > "Common transforms" > "To lowercase".
  5. Finally you can click "Export" in the top right hand corner and select "Comma separated value" to re-export the data as a CSV file that you can then work with in a spreadsheet.

There are many other ways to do this (e.g. using pandas).

Recounting the new column

Now you should have a table something like the following:

date item lower value
2021 ESEAHM2021 eseahm2021 179
2021 ESEAhm2021 eseahm2021 46
2021 eseahm2021 eseahm2021 44
2021 ESEA esea 16
2021 Earth2Air earth2air 13
2021 ESEAHM eseahm 13
2021 ESEAHeritageMonth eseaheritagemonth 11
2021 NIW2021 niw2021 10
2021 TheMSGpod themsgpod 8
2021 MoongateMix moongatemix 8

You'll see that the first three hashtags are still separate. How can we combine these?

To recount the new column you can:

  1. Create a pivot table
  2. Select your new column (in this case lower) for the rows
  3. Select the column with the counts (in this case value) for the values
  4. Summarise and sort by the SUM of the column with counts in (in this case value)

2021-11-09 12 59 17

If you have a date column and multiple different years for the hashtags, you can also add the date to the pivot table to ensure that multiple years are not merged together.

2021-11-09 13 07 27

You should now have a table with case insensitive counts of your hashtags, which you can export into a new csv file for further analysis. 🎊

Note the difference in counts between this new table and the tables above (which is why case insensitive counts can matter for analysis)! 😲

date lower SUM of value
2021 eseahm2021 270
2021 esea 18
2021 eseahm 17
2021 earth2air 13
2021 eseaheritagemonth 12
2021 niw2021 10
2021 beseakidlit 9
2021 themsgpod 8
2021 moongatemix 8
2021 londonpodfest 7