diff --git a/02_RProgramming/NOTUSED/grep/Regular Expressions - grep.pdf b/02_RProgramming/NOTUSED/grep/Regular Expressions - grep.pdf deleted file mode 100644 index f711b11d..00000000 Binary files a/02_RProgramming/NOTUSED/grep/Regular Expressions - grep.pdf and /dev/null differ diff --git a/02_RProgramming/NOTUSED/grep/index.Rmd b/02_RProgramming/NOTUSED/grep/index.Rmd deleted file mode 100644 index bd5eb147..00000000 --- a/02_RProgramming/NOTUSED/grep/index.Rmd +++ /dev/null @@ -1,337 +0,0 @@ ---- -title : Regular Expressions - grep -subtitle : Computing for Data Analysis -author : Roger Peng, Associate Professor -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../libraries - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## Regular Expression Functions - -The primary R functions for dealing with regular expressions are -- `grep`, `grepl`: Search for matches of a regular expression/pattern in a character vector; either return the indices into the character vector that match, the strings that happen to match, or a TRUE/FALSE vector indicating which elements match -- `regexpr`, `gregexpr`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match -- `sub`, `gsub`: Search a character vector for regular expression matches and replace that match with another string -- `regexec`: Easier to explain through demonstration. - ---- - -## grep - -Here is an excerpt of the Baltimore City homicides dataset: - -```r -> homicides <- readLines("homicides.txt") -> homicides[1] -[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’
Leon -Nelson
3400 Clifton Ave.
Baltimore, MD -21216
black male, 17 years old
-
Found on January 1, 2007
Victim died at Shock -Trauma
Cause: shooting
’" - -> homicides[1000] -[1] "39.33626300000, -76.55553990000, icon_homicide_shooting, ’p1200’,... -``` - -How can I find the records for all the victims of shootings (as opposed to other causes)? - ---- - -## grep - -```r -> length(grep("iconHomicideShooting", homicides)) -[1] 228 -> length(grep("iconHomicideShooting|icon_homicide_shooting", homicides)) -[1] 1003 -> length(grep("Cause: shooting", homicides)) -[1] 228 -> length(grep("Cause: [Ss]hooting", homicides)) -[1] 1003 -> length(grep("[Ss]hooting", homicides)) -[1] 1005 -``` - ---- - -## grep - -```r -> i <- grep("[cC]ause: [Ss]hooting", homicides) -> j <- grep("[Ss]hooting", homicides) -> str(i) - int [1:1003] 1 2 6 7 8 9 10 11 12 13 ... -> str(j) - int [1:1005] 1 2 6 7 8 9 10 11 12 13 ... -> setdiff(i, j) -integer(0) -> setdiff(j, i) -[1] 318 859 -``` - ---- - -## grep - -```r -> homicides[859] -[1] "39.33743900000, -76.66316500000, icon_homicide_bluntforce, -’p914’, ’
Steven Harris -
4200 Pimlico Road
Baltimore, MD 21215 -
Race: Black
Gender: male
Age: 38 years old
-
Found on July 29, 2010
Victim died at Scene
-
Cause: Blunt Force

Harris was -found dead July 22 and ruled a shooting victim; an autopsy -subsequently showed that he had not been shot,...

’" -``` - ---- - -## grep - -By default, `grep` returns the indices into the character vector where the regex pattern matches. - -```r -> grep("^New", state.name) -[1] 29 30 31 32 -Setting value = TRUE returns the actual elements of the character vector that match. > grep("^New", state.name, value = TRUE) -[1] "New Hampshire" "New Jersey" "New Mexico" "New York" -grepl returns a logical vector indicating which element matches. -> grepl("^New", state.name) - [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS -[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS -[25] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALS -[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS -[49] FALSE FALSE - -``` - ---- - -## regexpr - -Some limitations of `grep` -- The `grep` function tells you which strings in a character vector match a certain pattern but it doesn’t tell you exactly where the match occurs or what the match is (for a more complicated regex). -- The `regexpr` function gives you the index into each string where the match begins and the length of the match for that string. -- `regexpr` only gives you the first match of the string (reading left to right). `gregexpr` will give you all of the matches in a given string. - ---- - -## regexpr - -How can we find the date of the homicide? - -```r -> homicides[1] -[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’
Leon -Nelson
3400 Clifton Ave.
Baltimore, -MD 21216
black male, 17 years old
-
Found on January 1, 2007
Victim died at Shock -Trauma
Cause: shooting
’" -``` - -Can we just ’grep’ on “Found”? - ---- - -## regexpr - -The word ’found’ may be found elsewhere in the entry. - -```r -> homicides[954] -[1] "39.30677400000, -76.59891100000, icon_homicide_shooting, ’p816’, -’
1400 N Caroline St
Baltimore, MD 21213
-
Race: Black
Gender: male
Age: 29 years old
-
Found on March 3, 2010
Victim died at Scene
-
Cause: Shooting

Wheeler\\’s body -was found on the grounds of Dr. Bernard Harris Sr. Elementary -School

’" -``` - ---- - -## regexpr - -Let’s use the pattern -'
[F|f]ound(.*)
' -What does this look for? - -```r -> regexpr("
[F|f]ound(.*)
", homicides[1:10]) - [1] 177 178 188 189 178 182 178 187 182 183 -attr(,"match.length") - [1] 93 86 89 90 89 84 85 84 88 84 -attr(,"useBytes") -[1] TRUE -> substr(homicides[1], 177, 177 + 93 - 1) -[1] "
Found on January 1, 2007
Victim died at Shock - Trauma
Cause: shooting
" -``` - ---- - -## regexpr - -The previous pattern was too greedy and matched too much of the string. We need to use the ? metacharacter to make the regex “lazy”. - -```r -> regexpr("
[F|f]ound(.*?)
", homicides[1:10]) - [1] 177 178 188 189 178 182 178 187 182 183 -attr(,"match.length") - [1] 33 33 33 33 33 33 33 33 33 33 -attr(,"useBytes") -[1] TRUE -> substr(homicides[1], 177, 177 + 33 - 1) -[1] "
Found on January 1, 2007
" -``` - ---- - -## regmatches - -One handy function is regmatches which extracts the matches in the strings for you without you having to use `substr`. - -```r -> r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) -> regmatches(homicides[1:5], r) -[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" -[3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" -[5] "
Found on January 5, 2007
" -``` - ---- - -## sub/gsub - -Sometimes we need to clean things up or modify strings by matching a pattern and replacing it with something else. For example, how can we extract the data from this string? - -```r -> x <- substr(homicides[1], 177, 177 + 33 - 1) -> x -[1] "
Found on January 1, 2007
" -``` - -We want to strip out the stuff surrounding the “January 1, 2007” piece. - -```r -> sub("
[F|f]ound on |
", "", x) -[1] "January 1, 2007" -> gsub("
[F|f]ound on |
", "", x) -[1] "January 1, 2007" -``` - ---- - -## sub/gsub - -sub/gsub can take vector arguments - -```r -> r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) -> m <- regmatches(homicides[1:5], r) ->m -[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" -[3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" -[5] "
Found on January 5, 2007
" -> gsub("
[F|f]ound on |
", "", m) -[1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007" -[5] "January 5, 2007" -> as.Date(d, "%B %d, %Y") -[1] "2007-01-01" "2007-01-02" "2007-01-02" "2007-01-03" "2007-01-05" -``` - ---- - -## regexec - -The `regexec` function works like regexpr except it gives you the indices for parenthesized sub-expressions. - -```r -> regexec("
[F|f]ound on (.*?)
", homicides[1]) -[[1]] -[1] 177 190 -attr(,"match.length") -[1] 33 15 - -> regexec("
[F|f]ound on .*?
", homicides[1]) -[[1]] -[1] 177 -attr(,"match.length") -[1] 33 -``` - ---- - -## regexec - -Now we can extract the string in the parenthesized sub-expression. - -```r -> regexec("
[F|f]ound on (.*?)
", homicides[1]) -[[1]] -[1] 177 190 -attr(,"match.length") -[1] 33 15 - -> substr(homicides[1], 177, 177 + 33 - 1) -[1] "
Found on January 1, 2007
" - -> substr(homicides[1], 190, 190 + 15 - 1) -[1] "January 1, 2007" -``` - ---- - -## regexec - -Even easier with the regmatches function. - -```r -> r <- regexec("
[F|f]ound on (.*?)
", homicides[1:2]) -> regmatches(homicides[1:2], r) -[[1]] -[1] "
Found on January 1, 2007
" "January 1, 2007" - -[[2]] -[1] "
Found on January 2, 2007
" "January 2, 2007" -``` - ---- - -## regexec - -Let’s make a plot of monthly homicide counts - -```r -> r <- regexec("
[F|f]ound on (.*?)
", homicides) -> m <- regmatches(homicides, r) -> dates <- sapply(m, function(x) x[2]) -> dates <- as.Date(dates, "%B %d, %Y") -> hist(dates, "month", freq = TRUE) -``` - ---- - -## regexec - - - ---- - -## Summary - -The primary R functions for dealing with regular expressions are -- `grep`, `grepl`: Search for matches of a regular expression/pattern in a character vector -- `regexpr`, `gregexpr`: Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction with `regmatches` -- `sub`, `gsub`: Search a character vector for regular expression matches and replace that match with another string -- `regexec`: Gives you indices of parethensized sub-expressions. \ No newline at end of file diff --git a/02_RProgramming/NOTUSED/grep/index.html b/02_RProgramming/NOTUSED/grep/index.html deleted file mode 100644 index 2d0a8f7b..00000000 --- a/02_RProgramming/NOTUSED/grep/index.html +++ /dev/null @@ -1,494 +0,0 @@ - - - - Regular Expressions - grep - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Regular Expressions - grep

-

Computing for Data Analysis

-

Roger Peng, Associate Professor
Johns Hopkins Bloomberg School of Public Health

-
-
- - - -
-

Regular Expression Functions

-
-
-

The primary R functions for dealing with regular expressions are

- -
    -
  • grep, grepl: Search for matches of a regular expression/pattern in a character vector; either return the indices into the character vector that match, the strings that happen to match, or a TRUE/FALSE vector indicating which elements match
  • -
  • regexpr, gregexpr: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match
  • -
  • sub, gsub: Search a character vector for regular expression matches and replace that match with another string
  • -
  • regexec: Easier to explain through demonstration.
  • -
- -
- -
- - -
-

grep

-
-
-

Here is an excerpt of the Baltimore City homicides dataset:

- -
> homicides <- readLines("homicides.txt")
-> homicides[1]
-[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’<dl><dt>Leon
-Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD
-21216</dd><dd>black male, 17 years old</dd>
-<dd>Found on January 1, 2007</dd><dd>Victim died at Shock
-Trauma</dd><dd>Cause: shooting</dd></dl>’"
-
-> homicides[1000]
-[1] "39.33626300000, -76.55553990000, icon_homicide_shooting, ’p1200’,...
-
- -

How can I find the records for all the victims of shootings (as opposed to other causes)?

- -
- -
- - -
-

grep

-
-
-
> length(grep("iconHomicideShooting", homicides))
-[1] 228
-> length(grep("iconHomicideShooting|icon_homicide_shooting", homicides))
-[1] 1003
-> length(grep("Cause: shooting", homicides))
-[1] 228
-> length(grep("Cause: [Ss]hooting", homicides))
-[1] 1003
-> length(grep("[Ss]hooting", homicides))
-[1] 1005
-
- -
- -
- - -
-

grep

-
-
-
> i <- grep("[cC]ause: [Ss]hooting", homicides)
-> j <- grep("[Ss]hooting", homicides)
-> str(i)
- int [1:1003] 1 2 6 7 8 9 10 11 12 13 ...
-> str(j)
- int [1:1005] 1 2 6 7 8 9 10 11 12 13 ...
-> setdiff(i, j)
-integer(0)
-> setdiff(j, i)
-[1] 318 859
-
- -
- -
- - -
-

grep

-
-
-
> homicides[859]
-[1] "39.33743900000, -76.66316500000, icon_homicide_bluntforce,
-’p914’, ’<dl><dt><a href=\"http://essentials.baltimoresun.com/
-micro_sun/homicides/victim/914/steven-harris\">Steven Harris</a>
-</dt><dd class=\"address\">4200 Pimlico Road<br />Baltimore, MD 21215
-</dd><dd>Race: Black<br />Gender: male<br />Age: 38 years old</dd>
-<dd>Found on July 29, 2010</dd><dd>Victim died at Scene</dd>
-<dd>Cause: Blunt Force</dd><dd class=\"popup-note\"><p>Harris was
-found dead July 22 and ruled a shooting victim; an autopsy
-subsequently showed that he had not been shot,...</dd></dl>’"
-
- -
- -
- - -
-

grep

-
-
-

By default, grep returns the indices into the character vector where the regex pattern matches.

- -
> grep("^New", state.name)
-[1] 29 30 31 32
-Setting value = TRUE returns the actual elements of the character vector that match. > grep("^New", state.name, value = TRUE)
-[1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"
-grepl returns a logical vector indicating which element matches.
-> grepl("^New", state.name)
- [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS
-[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS
-[25] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALS
-[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS
-[49] FALSE FALSE
-
-
- -
- -
- - -
-

regexpr

-
-
-

Some limitations of grep

- -
    -
  • The grep function tells you which strings in a character vector match a certain pattern but it doesn’t tell you exactly where the match occurs or what the match is (for a more complicated regex).
  • -
  • The regexpr function gives you the index into each string where the match begins and the length of the match for that string.
  • -
  • regexpr only gives you the first match of the string (reading left to right). gregexpr will give you all of the matches in a given string.
  • -
- -
- -
- - -
-

regexpr

-
-
-

How can we find the date of the homicide?

- -
> homicides[1]
-[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’<dl><dt>Leon
-Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore,
-MD 21216</dd><dd>black male, 17 years old</dd>
-<dd>Found on January 1, 2007</dd><dd>Victim died at Shock
-Trauma</dd><dd>Cause: shooting</dd></dl>’"
-
- -

Can we just ’grep’ on “Found”?

- -
- -
- - -
-

regexpr

-
-
-

The word ’found’ may be found elsewhere in the entry.

- -
> homicides[954]
-[1] "39.30677400000, -76.59891100000, icon_homicide_shooting, ’p816’,
-’<dl><dd class=\"address\">1400 N Caroline St<br />Baltimore, MD 21213</dd>
-<dd>Race: Black<br />Gender: male<br />Age: 29 years old</dd>
-<dd>Found on March  3, 2010</dd><dd>Victim died at Scene</dd>
-<dd>Cause: Shooting</dd><dd class=\"popup-note\"><p>Wheeler\\’s body
-was&nbsp;found on the grounds of Dr. Bernard Harris Sr.&nbsp;Elementary
-School</p></dd></dl>’"
-
- -
- -
- - -
-

regexpr

-
-
-

Let’s use the pattern -'

[F|f]ound(.*)
' -What does this look for?

- -
> regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1:10])
- [1] 177 178 188 189 178 182 178 187 182 183
-attr(,"match.length")
- [1] 93 86 89 90 89 84 85 84 88 84
-attr(,"useBytes")
-[1] TRUE
-> substr(homicides[1], 177, 177 + 93 - 1)
-[1] "<dd>Found on January 1, 2007</dd><dd>Victim died at Shock
- Trauma</dd><dd>Cause: shooting</dd>"
-
- -
- -
- - -
-

regexpr

-
-
-

The previous pattern was too greedy and matched too much of the string. We need to use the ? metacharacter to make the regex “lazy”.

- -
> regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:10])
- [1] 177 178 188 189 178 182 178 187 182 183
-attr(,"match.length")
- [1] 33 33 33 33 33 33 33 33 33 33
-attr(,"useBytes")
-[1] TRUE
-> substr(homicides[1], 177, 177 + 33 - 1)
-[1] "<dd>Found on January 1, 2007</dd>"
-
- -
- -
- - -
-

regmatches

-
-
-

One handy function is regmatches which extracts the matches in the strings for you without you having to use substr.

- -
> r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
-> regmatches(homicides[1:5], r)
-[1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>"
-[3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>"
-[5] "<dd>Found on January 5, 2007</dd>"
-
- -
- -
- - -
-

sub/gsub

-
-
-

Sometimes we need to clean things up or modify strings by matching a pattern and replacing it with something else. For example, how can we extract the data from this string?

- -
> x <- substr(homicides[1], 177, 177 + 33 - 1) 
-> x
-[1] "<dd>Found on January 1, 2007</dd>"
-
- -

We want to strip out the stuff surrounding the “January 1, 2007” piece.

- -
> sub("<dd>[F|f]ound on |</dd>", "", x)
-[1] "January 1, 2007</dd>"
-> gsub("<dd>[F|f]ound on |</dd>", "", x)
-[1] "January 1, 2007"
-
- -
- -
- - -
-

sub/gsub

-
-
-

sub/gsub can take vector arguments

- -
> r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
-> m <- regmatches(homicides[1:5], r)
->m
-[1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>" 
-[3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>" 
-[5] "<dd>Found on January 5, 2007</dd>"
-> gsub("<dd>[F|f]ound on |</dd>", "", m)
-[1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007"
-[5] "January 5, 2007"
-> as.Date(d, "%B %d, %Y")
-[1] "2007-01-01" "2007-01-02" "2007-01-02" "2007-01-03" "2007-01-05"
-
- -
- -
- - -
-

regexec

-
-
-

The regexec function works like regexpr except it gives you the indices for parenthesized sub-expressions.

- -
> regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
-[[1]]
-[1] 177 190
-attr(,"match.length")
-[1] 33 15
-
-> regexec("<dd>[F|f]ound on .*?</dd>", homicides[1])
-[[1]]
-[1] 177
-attr(,"match.length")
-[1] 33
-
- -
- -
- - -
-

regexec

-
-
-

Now we can extract the string in the parenthesized sub-expression.

- -
> regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
-[[1]]
-[1] 177 190
-attr(,"match.length")
-[1] 33 15
-
-> substr(homicides[1], 177, 177 + 33 - 1)
-[1] "<dd>Found on January 1, 2007</dd>"
-
-> substr(homicides[1], 190, 190 + 15 - 1)
-[1] "January 1, 2007"
-
- -
- -
- - -
-

regexec

-
-
-

Even easier with the regmatches function.

- -
> r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1:2])
-> regmatches(homicides[1:2], r)
-[[1]]
-[1] "<dd>Found on January 1, 2007</dd>" "January 1, 2007"
-
-[[2]]
-[1] "<dd>Found on January 2, 2007</dd>" "January 2, 2007"
-
- -
- -
- - -
-

regexec

-
-
-

Let’s make a plot of monthly homicide counts

- -
> r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicides)
-> m <- regmatches(homicides, r)
-> dates <- sapply(m, function(x) x[2])
-> dates <- as.Date(dates, "%B %d, %Y")
-> hist(dates, "month", freq = TRUE)
-
- -
- -
- - -
-

regexec

-
-
-

- -
- -
- - -
-

Summary

-
-
-

The primary R functions for dealing with regular expressions are

- -
    -
  • grep, grepl: Search for matches of a regular expression/pattern in a character vector
  • -
  • regexpr, gregexpr: Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction with regmatches
  • -
  • sub, gsub: Search a character vector for regular expression matches and replace that match with another string
  • -
  • regexec: Gives you indices of parethensized sub-expressions.
  • -
- -
- -
- - -
- - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/02_RProgramming/NOTUSED/grep/index.md b/02_RProgramming/NOTUSED/grep/index.md deleted file mode 100644 index a4da4e1f..00000000 --- a/02_RProgramming/NOTUSED/grep/index.md +++ /dev/null @@ -1,337 +0,0 @@ ---- -title : Regular Expressions - grep -subtitle : Computing for Data Analysis -author : Roger Peng, Associate Professor -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../libraries - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## Regular Expression Functions - -The primary R functions for dealing with regular expressions are -- `grep`, `grepl`: Search for matches of a regular expression/pattern in a character vector; either return the indices into the character vector that match, the strings that happen to match, or a TRUE/FALSE vector indicating which elements match -- `regexpr`, `gregexpr`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match -- `sub`, `gsub`: Search a character vector for regular expression matches and replace that match with another string -- `regexec`: Easier to explain through demonstration. - ---- - -## grep - -Here is an excerpt of the Baltimore City homicides dataset: - -```r -> homicides <- readLines("homicides.txt") -> homicides[1] -[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’
Leon -Nelson
3400 Clifton Ave.
Baltimore, MD -21216
black male, 17 years old
-
Found on January 1, 2007
Victim died at Shock -Trauma
Cause: shooting
’" - -> homicides[1000] -[1] "39.33626300000, -76.55553990000, icon_homicide_shooting, ’p1200’,... -``` - -How can I find the records for all the victims of shootings (as opposed to other causes)? - ---- - -## grep - -```r -> length(grep("iconHomicideShooting", homicides)) -[1] 228 -> length(grep("iconHomicideShooting|icon_homicide_shooting", homicides)) -[1] 1003 -> length(grep("Cause: shooting", homicides)) -[1] 228 -> length(grep("Cause: [Ss]hooting", homicides)) -[1] 1003 -> length(grep("[Ss]hooting", homicides)) -[1] 1005 -``` - ---- - -## grep - -```r -> i <- grep("[cC]ause: [Ss]hooting", homicides) -> j <- grep("[Ss]hooting", homicides) -> str(i) - int [1:1003] 1 2 6 7 8 9 10 11 12 13 ... -> str(j) - int [1:1005] 1 2 6 7 8 9 10 11 12 13 ... -> setdiff(i, j) -integer(0) -> setdiff(j, i) -[1] 318 859 -``` - ---- - -## grep - -```r -> homicides[859] -[1] "39.33743900000, -76.66316500000, icon_homicide_bluntforce, -’p914’, ’
Steven Harris -
4200 Pimlico Road
Baltimore, MD 21215 -
Race: Black
Gender: male
Age: 38 years old
-
Found on July 29, 2010
Victim died at Scene
-
Cause: Blunt Force

Harris was -found dead July 22 and ruled a shooting victim; an autopsy -subsequently showed that he had not been shot,...

’" -``` - ---- - -## grep - -By default, `grep` returns the indices into the character vector where the regex pattern matches. - -```r -> grep("^New", state.name) -[1] 29 30 31 32 -Setting value = TRUE returns the actual elements of the character vector that match. > grep("^New", state.name, value = TRUE) -[1] "New Hampshire" "New Jersey" "New Mexico" "New York" -grepl returns a logical vector indicating which element matches. -> grepl("^New", state.name) - [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS -[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS -[25] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALS -[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS -[49] FALSE FALSE - -``` - ---- - -## regexpr - -Some limitations of `grep` -- The `grep` function tells you which strings in a character vector match a certain pattern but it doesn’t tell you exactly where the match occurs or what the match is (for a more complicated regex). -- The `regexpr` function gives you the index into each string where the match begins and the length of the match for that string. -- `regexpr` only gives you the first match of the string (reading left to right). `gregexpr` will give you all of the matches in a given string. - ---- - -## regexpr - -How can we find the date of the homicide? - -```r -> homicides[1] -[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’
Leon -Nelson
3400 Clifton Ave.
Baltimore, -MD 21216
black male, 17 years old
-
Found on January 1, 2007
Victim died at Shock -Trauma
Cause: shooting
’" -``` - -Can we just ’grep’ on “Found”? - ---- - -## regexpr - -The word ’found’ may be found elsewhere in the entry. - -```r -> homicides[954] -[1] "39.30677400000, -76.59891100000, icon_homicide_shooting, ’p816’, -’
1400 N Caroline St
Baltimore, MD 21213
-
Race: Black
Gender: male
Age: 29 years old
-
Found on March 3, 2010
Victim died at Scene
-
Cause: Shooting

Wheeler\\’s body -was found on the grounds of Dr. Bernard Harris Sr. Elementary -School

’" -``` - ---- - -## regexpr - -Let’s use the pattern -'
[F|f]ound(.*)
' -What does this look for? - -```r -> regexpr("
[F|f]ound(.*)
", homicides[1:10]) - [1] 177 178 188 189 178 182 178 187 182 183 -attr(,"match.length") - [1] 93 86 89 90 89 84 85 84 88 84 -attr(,"useBytes") -[1] TRUE -> substr(homicides[1], 177, 177 + 93 - 1) -[1] "
Found on January 1, 2007
Victim died at Shock - Trauma
Cause: shooting
" -``` - ---- - -## regexpr - -The previous pattern was too greedy and matched too much of the string. We need to use the ? metacharacter to make the regex “lazy”. - -```r -> regexpr("
[F|f]ound(.*?)
", homicides[1:10]) - [1] 177 178 188 189 178 182 178 187 182 183 -attr(,"match.length") - [1] 33 33 33 33 33 33 33 33 33 33 -attr(,"useBytes") -[1] TRUE -> substr(homicides[1], 177, 177 + 33 - 1) -[1] "
Found on January 1, 2007
" -``` - ---- - -## regmatches - -One handy function is regmatches which extracts the matches in the strings for you without you having to use `substr`. - -```r -> r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) -> regmatches(homicides[1:5], r) -[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" -[3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" -[5] "
Found on January 5, 2007
" -``` - ---- - -## sub/gsub - -Sometimes we need to clean things up or modify strings by matching a pattern and replacing it with something else. For example, how can we extract the data from this string? - -```r -> x <- substr(homicides[1], 177, 177 + 33 - 1) -> x -[1] "
Found on January 1, 2007
" -``` - -We want to strip out the stuff surrounding the “January 1, 2007” piece. - -```r -> sub("
[F|f]ound on |
", "", x) -[1] "January 1, 2007" -> gsub("
[F|f]ound on |
", "", x) -[1] "January 1, 2007" -``` - ---- - -## sub/gsub - -sub/gsub can take vector arguments - -```r -> r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) -> m <- regmatches(homicides[1:5], r) ->m -[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" -[3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" -[5] "
Found on January 5, 2007
" -> gsub("
[F|f]ound on |
", "", m) -[1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007" -[5] "January 5, 2007" -> as.Date(d, "%B %d, %Y") -[1] "2007-01-01" "2007-01-02" "2007-01-02" "2007-01-03" "2007-01-05" -``` - ---- - -## regexec - -The `regexec` function works like regexpr except it gives you the indices for parenthesized sub-expressions. - -```r -> regexec("
[F|f]ound on (.*?)
", homicides[1]) -[[1]] -[1] 177 190 -attr(,"match.length") -[1] 33 15 - -> regexec("
[F|f]ound on .*?
", homicides[1]) -[[1]] -[1] 177 -attr(,"match.length") -[1] 33 -``` - ---- - -## regexec - -Now we can extract the string in the parenthesized sub-expression. - -```r -> regexec("
[F|f]ound on (.*?)
", homicides[1]) -[[1]] -[1] 177 190 -attr(,"match.length") -[1] 33 15 - -> substr(homicides[1], 177, 177 + 33 - 1) -[1] "
Found on January 1, 2007
" - -> substr(homicides[1], 190, 190 + 15 - 1) -[1] "January 1, 2007" -``` - ---- - -## regexec - -Even easier with the regmatches function. - -```r -> r <- regexec("
[F|f]ound on (.*?)
", homicides[1:2]) -> regmatches(homicides[1:2], r) -[[1]] -[1] "
Found on January 1, 2007
" "January 1, 2007" - -[[2]] -[1] "
Found on January 2, 2007
" "January 2, 2007" -``` - ---- - -## regexec - -Let’s make a plot of monthly homicide counts - -```r -> r <- regexec("
[F|f]ound on (.*?)
", homicides) -> m <- regmatches(homicides, r) -> dates <- sapply(m, function(x) x[2]) -> dates <- as.Date(dates, "%B %d, %Y") -> hist(dates, "month", freq = TRUE) -``` - ---- - -## regexec - - - ---- - -## Summary - -The primary R functions for dealing with regular expressions are -- `grep`, `grepl`: Search for matches of a regular expression/pattern in a character vector -- `regexpr`, `gregexpr`: Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction with `regmatches` -- `sub`, `gsub`: Search a character vector for regular expression matches and replace that match with another string -- `regexec`: Gives you indices of parethensized sub-expressions. diff --git a/02_RProgramming/NOTUSED/regex/Regular Expressions.pdf b/02_RProgramming/NOTUSED/regex/Regular Expressions.pdf deleted file mode 100644 index 3f9d25eb..00000000 Binary files a/02_RProgramming/NOTUSED/regex/Regular Expressions.pdf and /dev/null differ diff --git a/02_RProgramming/NOTUSED/regex/index.Rmd b/02_RProgramming/NOTUSED/regex/index.Rmd deleted file mode 100644 index f563b995..00000000 --- a/02_RProgramming/NOTUSED/regex/index.Rmd +++ /dev/null @@ -1,475 +0,0 @@ ---- -title : Regular Expressions -subtitle : Computing for Data Analysis -author : Roger Peng, Associate Professor -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../libraries - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## Regular expressions - -- Regular expressions can be thought of as a combination of literals and _metacharacters_ -- To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar -- Regular expressions have a rich set of metacharacters - ---- - -## Literals - -Simplest pattern consists only of literals. The literal “nuclear” would match to the following lines: - -```markdown -Ooh. I just learned that to keep myself alive after a -nuclear blast! All I have to do is milk some rats -then drink the milk. Aweosme. :} - -Laozi says nuclear weapons are mas macho - -Chaos in a country that has nuclear weapons -- not good. - -my nephew is trying to teach me nuclear physics, or -possibly just trying to show me how smart he is -so I’ll be proud of him [which I am]. - -lol if you ever say "nuclear" people immediately think -DEATH by radiation LOL -``` - ---- - -## Literals - -The literal “Obama” would match to the following lines - -```markdown -Politics r dum. Not 2 long ago Clinton was sayin Obama -was crap n now she sez vote 4 him n unite? WTF? -Screw em both + Mcain. Go Ron Paul! - -Clinton conceeds to Obama but will her followers listen?? - -Are we sure Chelsea didn’t vote for Obama? - -thinking ... Michelle Obama is terrific! - -jetlag..no sleep...early mornig to starbux..Ms. Obama -was moving -``` - ---- - -## Regular Expressions - -- Simplest pattern consists only of literals; a match occurs if the sequence of literals occurs anywhere in the text being tested - -- What if we only want the word “Obama”? or sentences that end in the word “Clinton”, or “clinton” or “clinto”? - ---- - -## Regular Expressions - -We need a way to express -- whitespace word boundaries -- sets of literals -- the beginning and end of a line -- alternatives (“war” or “peace”) -Metacharacters to the rescue! - ---- - -## Metacharacters - -Some metacharacters represent the start of a line - -```markdown -^i think -``` - -will match the lines - -```markdown -i think we all rule for participating -i think i have been outed -i think this will be quite fun actually -i think i need to go to work -i think i first saw zombo in 1999. -``` - ---- - -## Metacharacters - -$ represents the end of a line - -```markdown -morning$ -``` - -will match the lines - -```markdown -well they had something this morning -then had to catch a tram home in the morning -dog obedience school in the morning -and yes happy birthday i forgot to say it earlier this morning -I walked in the rain this morning -good morning -``` - ---- - -## Character Classes with [] - -We can list a set of characters we will accept at a given point in the match - -```markdown -[Bb][Uu][Ss][Hh] -``` - -will match the lines - -```markdown -The democrats are playing, "Name the worst thing about Bush!" -I smelled the desert creosote bush, brownies, BBQ chicken -BBQ and bushwalking at Molonglo Gorge -Bush TOLD you that North Korea is part of the Axis of Evil -I’m listening to Bush - Hurricane (Album Version) -``` - ---- - -## Character Classes with [] - -```markdown -^[Ii] am -``` - -will match - -```markdown -i am so angry at my boyfriend i can’t even bear to -look at him - -i am boycotting the apple store - -I am twittering from iPhone - -I am a very vengeful person when you ruin my sweetheart. - -I am so over this. I need food. Mmmm bacon... -``` - ---- - -## Character Classes with [] - -Similarly, you can specify a range of letters [a-z] or [a-zA-Z]; notice that the order doesn’t matter - -```markdown -^[0-9][a-zA-Z] -``` - -will match the lines - -```markdown -7th inning stretch -2nd half soon to begin. OSU did just win something -3am - cant sleep - too hot still.. :( -5ft 7 sent from heaven -1st sign of starvagtion -``` - ---- - -## Character Classes with [] - -When used at the beginning of a character class, the “^” is also a metacharacter and indicates matching characters NOT in the indicated class - -```markdown -[^?.]$ -``` - -will match the lines - -```markdown -i like basketballs -6 and 9 -dont worry... we all die anyway! -Not in Baghdad -helicopter under water? hmmm -``` - ---- - -## More Metacharacters - -“.” is used to refer to any character. So - -```markdown -9.11 -``` - -will match the lines - -```markdown -its stupid the post 9-11 rules -if any 1 of us did 9/11 we would have been caught in days. -NetBios: scanning ip 203.169.114.66 -Front Door 9:11:46 AM -Sings: 0118999881999119725...3 ! -``` - ---- - -## More Metacharacters: | - -This does not mean “pipe” in the context of regular expressions; instead it translates to “or”; we can use it to combine two expressions, the subexpressions being called alternatives - -```markdown -flood|fire -``` - -will match the lines - -```markdown -is firewire like usb on none macs? -the global flood makes sense within the context of the bible -yeah ive had the fire on tonight -... and the floods, hurricanes, killer heatwaves, rednecks, gun nuts, etc. - -``` - ---- - -## More Metacharacters: | - -We can include any number of alternatives... - -```markdown -flood|earthquake|hurricane|coldfire -``` - -will match the lines - -```markdown -Not a whole lot of hurricanes in the Arctic. -We do have earthquakes nearly every day somewhere in our State -hurricanes swirl in the other direction -coldfire is STRAIGHT! -’cause we keep getting earthquakes -``` - ---- - -## More Metacharacters: | - -The alternatives can be real expressions and not just literals - -```markdown -^[Gg]ood|[Bb]ad -``` - -will match the lines - -```markdown -good to hear some good knews from someone here -Good afternoon fellow american infidels! -good on you-what do you drive? -Katie... guess they had bad experiences... -my middle name is trouble, Miss Bad News -``` - ---- - -## More Metacharacters: ( and ) - -Subexpressions are often contained in parentheses to constrain the alternatives - -```markdown -^([Gg]ood|[Bb]ad) -``` - -will match the lines - -```markdown -bad habbit -bad coordination today -good, becuase there is nothing worse than a man in kinky underwear -Badcop, its because people want to use drugs -Good Monday Holiday -Good riddance to Limey -``` - ---- - -## More Metacharacters: ? - -The question mark indicates that the indicated expression is optional - -```markdown -[Gg]eorge( [Ww]\.)? [Bb]ush -``` - -will match the lines - -```markdown -i bet i can spell better than you and george bush combined -BBC reported that President George W. Bush claimed God told him to invade I -a bird in the hand is worth two george bushes -``` - ---- - -## One thing to note... - -In the following - -```markdown -[Gg]eorge( [Ww]\.)? [Bb]ush -``` - -we wanted to match a “.” as a literal period; to do that, we had to “escape” the metacharacter, preceding it with a backslash In general, we have to do this for any metacharacter we want to include in our match - ---- - -## More metacharacters: * and + - -The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item” - -```markdown -(.*) -``` - -will match the lines - -```markdown -anyone wanna chat? (24, m, germany) -hello, 20.m here... ( east area + drives + webcam ) -(he means older men) -() -``` - ---- - -## More metacharacters: * and + - -The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item” - -```markdown -[0-9]+ (.*)[0-9]+ -``` - -will match the lines - -```markdown -working as MP here 720 MP battallion, 42nd birgade -so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin -it went down on several occasions for like, 3 or 4 *days* -Mmmm its time 4 me 2 go 2 bed -``` - ---- - -## More metacharacters: { and } - -{ and } are referred to as interval quantifiers; the let us specify the minimum and maximum number of matches of an expression - -```markdown -[Bb]ush( +[^ ]+ +){1,5} debate -``` - -will match the lines - -```markdown -Bush has historically won all major debates he’s done. -in my view, Bush doesn’t need these debates.. -bush doesn’t need the debates? maybe you are right -That’s what Bush supporters are doing about the debate. -Felix, I don’t disagree that Bush was poorly prepared for the debate. -indeed, but still, Bush should have taken the debate more seriously. -Keep repeating that Bush smirked and scowled during the debate -``` - ---- - -## More metacharacters: and - -- m,n means at least m but not more than n matches -- m means exactly m matches -- m, means at least m matches - ---- - -## More metacharacters: ( and ) revisited - -- In most implementations of regular expressions, the parentheses not only limit the scope of alternatives divided by a “|”, but also can be used to “remember” text matched by the subexpression enclosed -- We refer to the matched text with \1, \2, etc. - ---- - -## More metacharacters: ( and ) revisited - -So the expression - -```markdown -+([a-zA-Z]+) +\1 + -``` - -will match the lines - -```markdown -time for bed, night night twitter! -blah blah blah blah -my tattoo is so so itchy today -i was standing all all alone against the world outside... -hi anybody anybody at home -estudiando css css css css.... que desastritooooo -``` - ---- - -## More metacharacters: ( and ) revisited - -The * is “greedy” so it always matches the _longest_ possible string that satisfies the regular expression. So - -```markdown -^s(.*)s -``` - -matches - -```markdown -sitting at starbucks -setting up mysql and rails -studying stuff for the exams -spaghetti with marshmallows -stop fighting with crackers -sore shoulders, stupid ergonomics -``` - ---- - -## More metacharacters: ( and ) revisited - -The greediness of * can be turned off with the ?, as in - -```markdown -^s(.*?)s$ -``` - ---- - -## Summary - -- Regular expressions are used in many different languages; not unique to R. -- Regular expressions are composed of literals and metacharacters that represent sets or classes of characters/words -- Text processing via regular expressions is a very powerful way to extract data from “unfriendly” sources (not all data comes as a CSV file) -(Thanks to Mark Hansen for some material in this lecture.) \ No newline at end of file diff --git a/02_RProgramming/NOTUSED/regex/index.html b/02_RProgramming/NOTUSED/regex/index.html deleted file mode 100644 index b905e3a7..00000000 --- a/02_RProgramming/NOTUSED/regex/index.html +++ /dev/null @@ -1,649 +0,0 @@ - - - - Regular Expressions - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Regular Expressions

-

Computing for Data Analysis

-

Roger Peng, Associate Professor
Johns Hopkins Bloomberg School of Public Health

-
-
- - - -
-

Regular expressions

-
-
-
    -
  • Regular expressions can be thought of as a combination of literals and metacharacters
  • -
  • To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar
  • -
  • Regular expressions have a rich set of metacharacters
  • -
- -
- -
- - -
-

Literals

-
-
-

Simplest pattern consists only of literals. The literal “nuclear” would match to the following lines:

- -
Ooh. I just learned that to keep myself alive after a
-nuclear blast! All I have to do is milk some rats
-then drink the milk. Aweosme. :}
-
-Laozi says nuclear weapons are mas macho
-
-Chaos in a country that has nuclear weapons -- not good.
-
-my nephew is trying to teach me nuclear physics, or
-possibly just trying to show me how smart he is
-so I’ll be proud of him [which I am].
-
-lol if you ever say "nuclear" people immediately think
-DEATH by radiation LOL
-
- -
- -
- - -
-

Literals

-
-
-

The literal “Obama” would match to the following lines

- -
Politics r dum. Not 2 long ago Clinton was sayin Obama
-was crap n now she sez vote 4 him n unite? WTF?
-Screw em both + Mcain. Go Ron Paul!
-
-Clinton conceeds to Obama but will her followers listen??
-
-Are we sure Chelsea didn’t vote for Obama?
-
-thinking ... Michelle Obama is terrific!
-
-jetlag..no sleep...early mornig to starbux..Ms. Obama
-was moving
-
- -
- -
- - -
-

Regular Expressions

-
-
-
    -
  • Simplest pattern consists only of literals; a match occurs if the sequence of literals occurs anywhere in the text being tested

  • -
  • What if we only want the word “Obama”? or sentences that end in the word “Clinton”, or “clinton” or “clinto”?

  • -
- -
- -
- - -
-

Regular Expressions

-
-
-

We need a way to express

- -
    -
  • whitespace word boundaries
  • -
  • sets of literals
  • -
  • the beginning and end of a line
  • -
  • alternatives (“war” or “peace”) -Metacharacters to the rescue!
  • -
- -
- -
- - -
-

Metacharacters

-
-
-

Some metacharacters represent the start of a line

- -
^i think
-
- -

will match the lines

- -
i think we all rule for participating
-i think i have been outed
-i think this will be quite fun actually
-i think i need to go to work
-i think i first saw zombo in 1999.
-
- -
- -
- - -
-

Metacharacters

-
-
-

$ represents the end of a line

- -
morning$
-
- -

will match the lines

- -
well they had something this morning
-then had to catch a tram home in the morning
-dog obedience school in the morning
-and yes happy birthday i forgot to say it earlier this morning
-I walked in the rain this morning
-good morning
-
- -
- -
- - -
-

Character Classes with []

-
-
-

We can list a set of characters we will accept at a given point in the match

- -
[Bb][Uu][Ss][Hh]
-
- -

will match the lines

- -
The democrats are playing, "Name the worst thing about Bush!"
-I smelled the desert creosote bush, brownies, BBQ chicken
-BBQ and bushwalking at Molonglo Gorge
-Bush TOLD you that North Korea is part of the Axis of Evil
-I’m listening to Bush - Hurricane (Album Version)
-
- -
- -
- - -
-

Character Classes with []

-
-
-
^[Ii] am
-
- -

will match

- -
i am so angry at my boyfriend i can’t even bear to
-look at him
-
-i am boycotting the apple store
-
-I am twittering from iPhone
-
-I am a very vengeful person when you ruin my sweetheart.
-
-I am so over this. I need food. Mmmm bacon...
-
- -
- -
- - -
-

Character Classes with []

-
-
-

Similarly, you can specify a range of letters [a-z] or [a-zA-Z]; notice that the order doesn’t matter

- -
^[0-9][a-zA-Z]
-
- -

will match the lines

- -
7th inning stretch
-2nd half soon to begin. OSU did just win something
-3am - cant sleep - too hot still.. :(
-5ft 7 sent from heaven
-1st sign of starvagtion
-
- -
- -
- - -
-

Character Classes with []

-
-
-

When used at the beginning of a character class, the “ is also a metacharacter and indicates matching characters NOT in the indicated class

- -
[^?.]$
-
- -

will match the lines

- -
i like basketballs
-6 and 9
-dont worry... we all die anyway!
-Not in Baghdad
-helicopter under water? hmmm
-
- -
- -
- - -
-

More Metacharacters

-
-
-

“.” is used to refer to any character. So

- -
9.11
-
- -

will match the lines

- -
its stupid the post 9-11 rules
-if any 1 of us did 9/11 we would have been caught in days.
-NetBios: scanning ip 203.169.114.66
-Front Door 9:11:46 AM
-Sings: 0118999881999119725...3 !
-
- -
- -
- - -
-

More Metacharacters: |

-
-
-

This does not mean “pipe” in the context of regular expressions; instead it translates to “or”; we can use it to combine two expressions, the subexpressions being called alternatives

- -
flood|fire
-
- -

will match the lines

- -
is firewire like usb on none macs?
-the global flood makes sense within the context of the bible
-yeah ive had the fire on tonight
-... and the floods, hurricanes, killer heatwaves, rednecks, gun nuts, etc.
-
-
- -
- -
- - -
-

More Metacharacters: |

-
-
-

We can include any number of alternatives...

- -
flood|earthquake|hurricane|coldfire
-
- -

will match the lines

- -
Not a whole lot of hurricanes in the Arctic.
-We do have earthquakes nearly every day somewhere in our State
-hurricanes swirl in the other direction
-coldfire is STRAIGHT!
-’cause we keep getting earthquakes
-
- -
- -
- - -
-

More Metacharacters: |

-
-
-

The alternatives can be real expressions and not just literals

- -
^[Gg]ood|[Bb]ad
-
- -

will match the lines

- -
good to hear some good knews from someone here
-Good afternoon fellow american infidels!
-good on you-what do you drive?
-Katie... guess they had bad experiences...
-my middle name is trouble, Miss Bad News
-
- -
- -
- - -
-

More Metacharacters: ( and )

-
-
-

Subexpressions are often contained in parentheses to constrain the alternatives

- -
^([Gg]ood|[Bb]ad)
-
- -

will match the lines

- -
bad habbit
-bad coordination today
-good, becuase there is nothing worse than a man in kinky underwear
-Badcop, its because people want to use drugs
-Good Monday Holiday
-Good riddance to Limey
-
- -
- -
- - -
-

More Metacharacters: ?

-
-
-

The question mark indicates that the indicated expression is optional

- -
[Gg]eorge( [Ww]\.)? [Bb]ush
-
- -

will match the lines

- -
i bet i can spell better than you and george bush combined
-BBC reported that President George W. Bush claimed God told him to invade I
-a bird in the hand is worth two george bushes
-
- -
- -
- - -
-

One thing to note...

-
-
-

In the following

- -
[Gg]eorge( [Ww]\.)? [Bb]ush
-
- -

we wanted to match a “.” as a literal period; to do that, we had to “escape” the metacharacter, preceding it with a backslash In general, we have to do this for any metacharacter we want to include in our match

- -
- -
- - -
-

More metacharacters: * and +

-
-
-

The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”

- -
(.*)
-
- -

will match the lines

- -
anyone wanna chat? (24, m, germany)
-hello, 20.m here... ( east area + drives + webcam )
-(he means older men)
-()
-
- -
- -
- - -
-

More metacharacters: * and +

-
-
-

The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”

- -
[0-9]+ (.*)[0-9]+
-
- -

will match the lines

- -
working as MP here 720 MP battallion, 42nd birgade
-so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin
-it went down on several occasions for like, 3 or 4 *days*
-Mmmm its time 4 me 2 go 2 bed
-
- -
- -
- - -
-

More metacharacters: { and }

-
-
-

{ and } are referred to as interval quantifiers; the let us specify the minimum and maximum number of matches of an expression

- -
[Bb]ush( +[^ ]+ +){1,5} debate
-
- -

will match the lines

- -
Bush has historically won all major debates he’s done.
-in my view, Bush doesn’t need these debates..
-bush doesn’t need the debates? maybe you are right
-That’s what Bush supporters are doing about the debate.
-Felix, I don’t disagree that Bush was poorly prepared for the debate.
-indeed, but still, Bush should have taken the debate more seriously.
-Keep repeating that Bush smirked and scowled during the debate
-
- -
- -
- - -
-

More metacharacters: and

-
-
-
    -
  • m,n means at least m but not more than n matches
  • -
  • m means exactly m matches
  • -
  • m, means at least m matches
  • -
- -
- -
- - -
-

More metacharacters: ( and ) revisited

-
-
-
    -
  • In most implementations of regular expressions, the parentheses not only limit the scope of alternatives divided by a “|”, but also can be used to “remember” text matched by the subexpression enclosed
  • -
  • We refer to the matched text with \1, \2, etc.
  • -
- -
- -
- - -
-

More metacharacters: ( and ) revisited

-
-
-

So the expression

- -
+([a-zA-Z]+) +\1 +
-
- -

will match the lines

- -
time for bed, night night twitter!
-blah blah blah blah
-my tattoo is so so itchy today
-i was standing all all alone against the world outside...
-hi anybody anybody at home
-estudiando css css css css.... que desastritooooo
-
- -
- -
- - -
-

More metacharacters: ( and ) revisited

-
-
-

The * is “greedy” so it always matches the longest possible string that satisfies the regular expression. So

- -
^s(.*)s
-
- -

matches

- -
sitting at starbucks
-setting up mysql and rails
-studying stuff for the exams
-spaghetti with marshmallows
-stop fighting with crackers
-sore shoulders, stupid ergonomics
-
- -
- -
- - -
-

More metacharacters: ( and ) revisited

-
-
-

The greediness of * can be turned off with the ?, as in

- -
^s(.*?)s$
-
- -
- -
- - -
-

Summary

-
-
-
    -
  • Regular expressions are used in many different languages; not unique to R.
  • -
  • Regular expressions are composed of literals and metacharacters that represent sets or classes of characters/words
  • -
  • Text processing via regular expressions is a very powerful way to extract data from “unfriendly” sources (not all data comes as a CSV file) -(Thanks to Mark Hansen for some material in this lecture.)
  • -
- -
- -
- - -
- - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/02_RProgramming/NOTUSED/regex/index.md b/02_RProgramming/NOTUSED/regex/index.md deleted file mode 100644 index 8da8f474..00000000 --- a/02_RProgramming/NOTUSED/regex/index.md +++ /dev/null @@ -1,475 +0,0 @@ ---- -title : Regular Expressions -subtitle : Computing for Data Analysis -author : Roger Peng, Associate Professor -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../libraries - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## Regular expressions - -- Regular expressions can be thought of as a combination of literals and _metacharacters_ -- To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar -- Regular expressions have a rich set of metacharacters - ---- - -## Literals - -Simplest pattern consists only of literals. The literal “nuclear” would match to the following lines: - -```markdown -Ooh. I just learned that to keep myself alive after a -nuclear blast! All I have to do is milk some rats -then drink the milk. Aweosme. :} - -Laozi says nuclear weapons are mas macho - -Chaos in a country that has nuclear weapons -- not good. - -my nephew is trying to teach me nuclear physics, or -possibly just trying to show me how smart he is -so I’ll be proud of him [which I am]. - -lol if you ever say "nuclear" people immediately think -DEATH by radiation LOL -``` - ---- - -## Literals - -The literal “Obama” would match to the following lines - -```markdown -Politics r dum. Not 2 long ago Clinton was sayin Obama -was crap n now she sez vote 4 him n unite? WTF? -Screw em both + Mcain. Go Ron Paul! - -Clinton conceeds to Obama but will her followers listen?? - -Are we sure Chelsea didn’t vote for Obama? - -thinking ... Michelle Obama is terrific! - -jetlag..no sleep...early mornig to starbux..Ms. Obama -was moving -``` - ---- - -## Regular Expressions - -- Simplest pattern consists only of literals; a match occurs if the sequence of literals occurs anywhere in the text being tested - -- What if we only want the word “Obama”? or sentences that end in the word “Clinton”, or “clinton” or “clinto”? - ---- - -## Regular Expressions - -We need a way to express -- whitespace word boundaries -- sets of literals -- the beginning and end of a line -- alternatives (“war” or “peace”) -Metacharacters to the rescue! - ---- - -## Metacharacters - -Some metacharacters represent the start of a line - -```markdown -^i think -``` - -will match the lines - -```markdown -i think we all rule for participating -i think i have been outed -i think this will be quite fun actually -i think i need to go to work -i think i first saw zombo in 1999. -``` - ---- - -## Metacharacters - -$ represents the end of a line - -```markdown -morning$ -``` - -will match the lines - -```markdown -well they had something this morning -then had to catch a tram home in the morning -dog obedience school in the morning -and yes happy birthday i forgot to say it earlier this morning -I walked in the rain this morning -good morning -``` - ---- - -## Character Classes with [] - -We can list a set of characters we will accept at a given point in the match - -```markdown -[Bb][Uu][Ss][Hh] -``` - -will match the lines - -```markdown -The democrats are playing, "Name the worst thing about Bush!" -I smelled the desert creosote bush, brownies, BBQ chicken -BBQ and bushwalking at Molonglo Gorge -Bush TOLD you that North Korea is part of the Axis of Evil -I’m listening to Bush - Hurricane (Album Version) -``` - ---- - -## Character Classes with [] - -```markdown -^[Ii] am -``` - -will match - -```markdown -i am so angry at my boyfriend i can’t even bear to -look at him - -i am boycotting the apple store - -I am twittering from iPhone - -I am a very vengeful person when you ruin my sweetheart. - -I am so over this. I need food. Mmmm bacon... -``` - ---- - -## Character Classes with [] - -Similarly, you can specify a range of letters [a-z] or [a-zA-Z]; notice that the order doesn’t matter - -```markdown -^[0-9][a-zA-Z] -``` - -will match the lines - -```markdown -7th inning stretch -2nd half soon to begin. OSU did just win something -3am - cant sleep - too hot still.. :( -5ft 7 sent from heaven -1st sign of starvagtion -``` - ---- - -## Character Classes with [] - -When used at the beginning of a character class, the “^” is also a metacharacter and indicates matching characters NOT in the indicated class - -```markdown -[^?.]$ -``` - -will match the lines - -```markdown -i like basketballs -6 and 9 -dont worry... we all die anyway! -Not in Baghdad -helicopter under water? hmmm -``` - ---- - -## More Metacharacters - -“.” is used to refer to any character. So - -```markdown -9.11 -``` - -will match the lines - -```markdown -its stupid the post 9-11 rules -if any 1 of us did 9/11 we would have been caught in days. -NetBios: scanning ip 203.169.114.66 -Front Door 9:11:46 AM -Sings: 0118999881999119725...3 ! -``` - ---- - -## More Metacharacters: | - -This does not mean “pipe” in the context of regular expressions; instead it translates to “or”; we can use it to combine two expressions, the subexpressions being called alternatives - -```markdown -flood|fire -``` - -will match the lines - -```markdown -is firewire like usb on none macs? -the global flood makes sense within the context of the bible -yeah ive had the fire on tonight -... and the floods, hurricanes, killer heatwaves, rednecks, gun nuts, etc. - -``` - ---- - -## More Metacharacters: | - -We can include any number of alternatives... - -```markdown -flood|earthquake|hurricane|coldfire -``` - -will match the lines - -```markdown -Not a whole lot of hurricanes in the Arctic. -We do have earthquakes nearly every day somewhere in our State -hurricanes swirl in the other direction -coldfire is STRAIGHT! -’cause we keep getting earthquakes -``` - ---- - -## More Metacharacters: | - -The alternatives can be real expressions and not just literals - -```markdown -^[Gg]ood|[Bb]ad -``` - -will match the lines - -```markdown -good to hear some good knews from someone here -Good afternoon fellow american infidels! -good on you-what do you drive? -Katie... guess they had bad experiences... -my middle name is trouble, Miss Bad News -``` - ---- - -## More Metacharacters: ( and ) - -Subexpressions are often contained in parentheses to constrain the alternatives - -```markdown -^([Gg]ood|[Bb]ad) -``` - -will match the lines - -```markdown -bad habbit -bad coordination today -good, becuase there is nothing worse than a man in kinky underwear -Badcop, its because people want to use drugs -Good Monday Holiday -Good riddance to Limey -``` - ---- - -## More Metacharacters: ? - -The question mark indicates that the indicated expression is optional - -```markdown -[Gg]eorge( [Ww]\.)? [Bb]ush -``` - -will match the lines - -```markdown -i bet i can spell better than you and george bush combined -BBC reported that President George W. Bush claimed God told him to invade I -a bird in the hand is worth two george bushes -``` - ---- - -## One thing to note... - -In the following - -```markdown -[Gg]eorge( [Ww]\.)? [Bb]ush -``` - -we wanted to match a “.” as a literal period; to do that, we had to “escape” the metacharacter, preceding it with a backslash In general, we have to do this for any metacharacter we want to include in our match - ---- - -## More metacharacters: * and + - -The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item” - -```markdown -(.*) -``` - -will match the lines - -```markdown -anyone wanna chat? (24, m, germany) -hello, 20.m here... ( east area + drives + webcam ) -(he means older men) -() -``` - ---- - -## More metacharacters: * and + - -The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item” - -```markdown -[0-9]+ (.*)[0-9]+ -``` - -will match the lines - -```markdown -working as MP here 720 MP battallion, 42nd birgade -so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin -it went down on several occasions for like, 3 or 4 *days* -Mmmm its time 4 me 2 go 2 bed -``` - ---- - -## More metacharacters: { and } - -{ and } are referred to as interval quantifiers; the let us specify the minimum and maximum number of matches of an expression - -```markdown -[Bb]ush( +[^ ]+ +){1,5} debate -``` - -will match the lines - -```markdown -Bush has historically won all major debates he’s done. -in my view, Bush doesn’t need these debates.. -bush doesn’t need the debates? maybe you are right -That’s what Bush supporters are doing about the debate. -Felix, I don’t disagree that Bush was poorly prepared for the debate. -indeed, but still, Bush should have taken the debate more seriously. -Keep repeating that Bush smirked and scowled during the debate -``` - ---- - -## More metacharacters: and - -- m,n means at least m but not more than n matches -- m means exactly m matches -- m, means at least m matches - ---- - -## More metacharacters: ( and ) revisited - -- In most implementations of regular expressions, the parentheses not only limit the scope of alternatives divided by a “|”, but also can be used to “remember” text matched by the subexpression enclosed -- We refer to the matched text with \1, \2, etc. - ---- - -## More metacharacters: ( and ) revisited - -So the expression - -```markdown -+([a-zA-Z]+) +\1 + -``` - -will match the lines - -```markdown -time for bed, night night twitter! -blah blah blah blah -my tattoo is so so itchy today -i was standing all all alone against the world outside... -hi anybody anybody at home -estudiando css css css css.... que desastritooooo -``` - ---- - -## More metacharacters: ( and ) revisited - -The * is “greedy” so it always matches the _longest_ possible string that satisfies the regular expression. So - -```markdown -^s(.*)s -``` - -matches - -```markdown -sitting at starbucks -setting up mysql and rails -studying stuff for the exams -spaghetti with marshmallows -stop fighting with crackers -sore shoulders, stupid ergonomics -``` - ---- - -## More metacharacters: ( and ) revisited - -The greediness of * can be turned off with the ?, as in - -```markdown -^s(.*?)s$ -``` - ---- - -## Summary - -- Regular expressions are used in many different languages; not unique to R. -- Regular expressions are composed of literals and metacharacters that represent sets or classes of characters/words -- Text processing via regular expressions is a very powerful way to extract data from “unfriendly” sources (not all data comes as a CSV file) -(Thanks to Mark Hansen for some material in this lecture.) diff --git a/02_RProgramming/Subsetting/index.Rmd b/02_RProgramming/Subsetting/index.Rmd index 64179968..88816007 100644 --- a/02_RProgramming/Subsetting/index.Rmd +++ b/02_RProgramming/Subsetting/index.Rmd @@ -21,7 +21,7 @@ There are a number of operators that can be used to extract subsets of R objects - `[[` is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame -- `$` is used to extract elements of a list or data frame by name; semantics are similar to hat of `[[`. +- `$` is used to extract elements of a list or data frame by name; semantics are similar to that of `[[`. --- @@ -237,4 +237,4 @@ What if there are multiple things and you want to take the subset with no missin 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 7 23 299 8.6 65 5 7 -``` \ No newline at end of file +``` diff --git a/02_RProgramming/Subsetting/index.html b/02_RProgramming/Subsetting/index.html index 52258492..e907844a 100644 --- a/02_RProgramming/Subsetting/index.html +++ b/02_RProgramming/Subsetting/index.html @@ -19,6 +19,11 @@ + + + + + @@ -53,7 +58,7 @@

Subsetting

diff --git a/02_RProgramming/Subsetting/index.md b/02_RProgramming/Subsetting/index.md index f236e5ae..88816007 100644 --- a/02_RProgramming/Subsetting/index.md +++ b/02_RProgramming/Subsetting/index.md @@ -21,7 +21,7 @@ There are a number of operators that can be used to extract subsets of R objects - `[[` is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame -- `$` is used to extract elements of a list or data frame by name; semantics are similar to hat of `[[`. +- `$` is used to extract elements of a list or data frame by name; semantics are similar to that of `[[`. --- diff --git a/02_RProgramming/help/GettingHelp.pdf b/02_RProgramming/help/GettingHelp.pdf deleted file mode 100644 index 0b84067a..00000000 Binary files a/02_RProgramming/help/GettingHelp.pdf and /dev/null differ diff --git a/02_RProgramming/help/slides/help_slide01.png b/02_RProgramming/help/slides/help_slide01.png index 3288e691..8e909a13 100644 Binary files a/02_RProgramming/help/slides/help_slide01.png and b/02_RProgramming/help/slides/help_slide01.png differ diff --git a/02_RProgramming/help/slides/help_slide02.png b/02_RProgramming/help/slides/help_slide02.png index deb60190..59b313df 100644 Binary files a/02_RProgramming/help/slides/help_slide02.png and b/02_RProgramming/help/slides/help_slide02.png differ diff --git a/02_RProgramming/help/slides/help_slide03.png b/02_RProgramming/help/slides/help_slide03.png index 3a011302..2a25751e 100644 Binary files a/02_RProgramming/help/slides/help_slide03.png and b/02_RProgramming/help/slides/help_slide03.png differ diff --git a/02_RProgramming/help/slides/help_slide04.png b/02_RProgramming/help/slides/help_slide04.png index 08859e73..11cd28fd 100644 Binary files a/02_RProgramming/help/slides/help_slide04.png and b/02_RProgramming/help/slides/help_slide04.png differ diff --git a/02_RProgramming/help/slides/help_slide05.png b/02_RProgramming/help/slides/help_slide05.png index f345144c..da1cc664 100644 Binary files a/02_RProgramming/help/slides/help_slide05.png and b/02_RProgramming/help/slides/help_slide05.png differ diff --git a/02_RProgramming/help/slides/help_slide06.png b/02_RProgramming/help/slides/help_slide06.png index a6aa30a6..e738a843 100644 Binary files a/02_RProgramming/help/slides/help_slide06.png and b/02_RProgramming/help/slides/help_slide06.png differ diff --git a/02_RProgramming/help/slides/help_slide07.png b/02_RProgramming/help/slides/help_slide07.png index 3a7d8bd5..0774b91b 100644 Binary files a/02_RProgramming/help/slides/help_slide07.png and b/02_RProgramming/help/slides/help_slide07.png differ diff --git a/02_RProgramming/help/slides/help_slide08.png b/02_RProgramming/help/slides/help_slide08.png index 38a6a0b9..b609bc07 100644 Binary files a/02_RProgramming/help/slides/help_slide08.png and b/02_RProgramming/help/slides/help_slide08.png differ diff --git a/02_RProgramming/help/slides/help_slide09.png b/02_RProgramming/help/slides/help_slide09.png index 03de7e78..051e8c40 100644 Binary files a/02_RProgramming/help/slides/help_slide09.png and b/02_RProgramming/help/slides/help_slide09.png differ diff --git a/02_RProgramming/help/slides/help_slide10.png b/02_RProgramming/help/slides/help_slide10.png index bfbfbc7b..403aff32 100644 Binary files a/02_RProgramming/help/slides/help_slide10.png and b/02_RProgramming/help/slides/help_slide10.png differ diff --git a/02_RProgramming/help/slides/help_slide11.png b/02_RProgramming/help/slides/help_slide11.png index e79f52c2..4279c36d 100644 Binary files a/02_RProgramming/help/slides/help_slide11.png and b/02_RProgramming/help/slides/help_slide11.png differ diff --git a/02_RProgramming/help/slides/help_slide12.png b/02_RProgramming/help/slides/help_slide12.png index e9f98bf0..2515a669 100644 Binary files a/02_RProgramming/help/slides/help_slide12.png and b/02_RProgramming/help/slides/help_slide12.png differ diff --git a/02_RProgramming/help/slides/help_slide13.png b/02_RProgramming/help/slides/help_slide13.png index cab85784..e37057d1 100644 Binary files a/02_RProgramming/help/slides/help_slide13.png and b/02_RProgramming/help/slides/help_slide13.png differ diff --git a/02_RProgramming/help/slides/help_slide14.png b/02_RProgramming/help/slides/help_slide14.png index 2bea08c7..cee06269 100644 Binary files a/02_RProgramming/help/slides/help_slide14.png and b/02_RProgramming/help/slides/help_slide14.png differ diff --git a/02_RProgramming/lectures/Subsetting.pdf b/02_RProgramming/lectures/Subsetting.pdf index 92b158e2..0e576a6d 100644 Binary files a/02_RProgramming/lectures/Subsetting.pdf and b/02_RProgramming/lectures/Subsetting.pdf differ