Answers
You can use the
strpos()
function which is used to find the occurrence of one string inside another one:$a = 'How are you?';
if (strpos($a, 'are') !== false) {
echo 'true';
}
Note that the use of
!== false
is deliberate; strpos()
returns either the offset at which the needle string begins in the haystack string, or the boolean false
if the needle isn't found. Since 0 is a valid offset and 0 is "falsey", we can't use simpler constructs like !strpos($a, 'are')
.
Use strpos function:
if (strpos($a, 'are') !== false)
echo 'true';
While most of these answers will tell you if a substring appears in your string, that's usually not what you want if you're looking for a particular word, and not a substring.
What's the difference? Substrings can appear within other words:
- The "are" at the beginning of "area"
- The "are" at the end of "hare"
- The "are" in the middle of "fares"
One way to mitigate this would be to use a regular expression coupled with word boundaries (
\b
):function containsWord($str, $word)
{
return !!preg_match('#\\b' . preg_quote($word, '#') . '\\b#i', $str);
}
This method doesn't have the same false positives noted above, but it does have some edge cases of its own. Word boundaries match on non-word characters (
\W
), which are going to be anything that isn't a-z
, A-Z
, 0-9
, or _
. That means digits and underscores are going to be counted as word characters and scenarios like this will fail:- The "are" in "What _are_ you thinking?"
- The "are" in "lol u dunno wut those are4?"
If you want anything more accurate than this, you'll have to start doing English language syntax parsing, and that's a pretty big can of worms (and assumes proper use of syntax, anyway, which isn't always a given).
If you want to avoid the "falsey" and "truthy" problem, you can use substr_count:
if (substr_count($a, 'are') > 0) {
echo "at least one 'are' is present!";
}
It's a bit slower than strpos but it avoids the comparison problems.
Peer to SamGoody and Lego Stormtroopr comments.
If you are looking for a PHP algorithm to rank search results based on proximity/relevance of multiple words here comes a quick and easy way of generating search results with PHP only:
Issues with the other boolean search methods such as
strpos()
, preg_match()
, strstr()
or stristr()
- can't search for multiple words
- results are unranked
PHP method based on Vector Space Model and tf-idf (term frequency–inverse document frequency):
It sounds difficult but is surprisingly easy.
If we want to search for multiple words in a string the core problem is how we assign a weight to each one of them?
If we could weight the terms in a string based on how representative they are of the string as a whole, we could order our results by the ones that best match the query.
This is the idea of the vector space model, not far from how SQL full-text search works:
function get_corpus_index($corpus = array(), $separator=' ') {
$dictionary = array();
$doc_count = array();
foreach($corpus as $doc_id => $doc) {
$terms = explode($separator, $doc);
$doc_count[$doc_id] = count($terms);
// tf–idf, short for term frequency–inverse document frequency,
// according to wikipedia is a numerical statistic that is intended to reflect
// how important a word is to a document in a corpus
foreach($terms as $term) {
if(!isset($dictionary[$term])) {
$dictionary[$term] = array('document_frequency' => 0, 'postings' => array());
}
if(!isset($dictionary[$term]['postings'][$doc_id])) {
$dictionary[$term]['document_frequency']++;
$dictionary[$term]['postings'][$doc_id] = array('term_frequency' => 0);
}
$dictionary[$term]['postings'][$doc_id]['term_frequency']++;
}
//from http://phpir.com/simple-search-the-vector-space-model/
}
return array('doc_count' => $doc_count, 'dictionary' => $dictionary);
}
function get_similar_documents($query='', $corpus=array(), $separator=' '){
$similar_documents=array();
if($query!=''&&!empty($corpus)){
$words=explode($separator,$query);
$corpus=get_corpus_index($corpus, $separator);
$doc_count=count($corpus['doc_count']);
foreach($words as $word) {
if(isset($corpus['dictionary'][$word])){
$entry = $corpus['dictionary'][$word];
foreach($entry['postings'] as $doc_id => $posting) {
//get term frequency–inverse document frequency
$score=$posting['term_frequency'] * log($doc_count + 1 / $entry['document_frequency'] + 1, 2);
if(isset($similar_documents[$doc_id])){
$similar_documents[$doc_id]+=$score;
}
else{
$similar_documents[$doc_id]=$score;
}
}
}
}
// length normalise
foreach($similar_documents as $doc_id => $score) {
$similar_documents[$doc_id] = $score/$corpus['doc_count'][$doc_id];
}
// sort from high to low
arsort($similar_documents);
}
return $similar_documents;
}
CASE 1
$query = 'are';
$corpus = array(
1 => 'How are you?',
);
$match_results=get_similar_documents($query,$corpus);
echo '<pre>';
print_r($match_results);
echo '</pre>';
RESULT
Array
(
[1] => 0.52832083357372
)
CASE 2
$query = 'are';
$corpus = array(
1 => 'how are you today?',
2 => 'how do you do',
3 => 'here you are! how are you? Are we done yet?'
);
$match_results=get_similar_documents($query,$corpus);
echo '<pre>';
print_r($match_results);
echo '</pre>';
RESULTS
Array
(
[1] => 0.54248125036058
[3] => 0.21699250014423
)
CASE 3
$query = 'we are done';
$corpus = array(
1 => 'how are you today?',
2 => 'how do you do',
3 => 'here you are! how are you? Are we done yet?'
);
$match_results=get_similar_documents($query,$corpus);
echo '<pre>';
print_r($match_results);
echo '</pre>';
RESULTS
Array
(
[3] => 0.6813781191217
[1] => 0.54248125036058
)
There are plenty of improvements to be made but the model provides a way of getting good results from natural queries, which don't have boolean operators such as
strpos()
, preg_match()
, strstr()
or stristr()
.
NOTA BENE
Optionally eliminating redundancy prior to search the words
- thereby reducing index size and resulting in less storage requirement
- less disk I/O
- faster indexing and a consequently faster search.
1. Normalisation
- Convert all text to lower case
2. Stopword elimination
- Eliminate words from the text which carry no real meaning (like 'and', 'or', 'the', 'for', etc.)
3. Dictionary substitution
- Replace words with others which have an identical or similar meaning. (ex:replace instances of 'hungrily' and 'hungry' with 'hunger')
- Further algorithmic measures (snowball) may be performed to further reduce words to their essential meaning.
- The replacement of colour names with their hexadecimal equivalents
- The reduction of numeric values by reducing precision are other ways of normalising the text.
RESOURCES
- http://linuxgazette.net/164/sephton.html
- http://snowball.tartarus.org/
- MySQL Fulltext Search Score Explained
- http://dev.mysql.com/doc/internals/en/full-text-search.html
- http://en.wikipedia.org/wiki/Vector_space_model
- http://en.wikipedia.org/wiki/Tf%E2%80%93idf
- http://phpir.com/simple-search-the-vector-space-model/
I'm a bit impressed that none of the answers here that used
strpos
, strstr
and similar functions mentioned Multibyte String Functions yet (2015-05-08).
Basically, if you're having trouble finding words with characters specific to some languages, such as German, French, Portuguese, Spanish, etc. (e.g.: ä, é, ô, ç, º, ñ), you may want to precede the functions with
mb_
. Therefore, the accepted answer would use mb_strpos
or mb_stripos
(for case-insensitive matching) instead:if (mb_strpos($a,'are') !== false) {
echo 'true';
}
If you cannot guarantee that all your data is 100% in UTF-8, you may want to use the
mb_
functions.
A good article to understand why is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.
if (preg_match('are', $a)) {
echo 'true';
}
You can use the
strstr
function:$haystack = "I know programming";
$needle = "know";
$flag = strstr($haystack, $needle);
if ($flag){
echo "true";
}
Without using an inbuilt function:
$haystack = "hello world";
$needle = "llo";
$i = $j = 0;
while (isset($needle[$i])) {
while (isset($haystack[$j]) && ($needle[$i] != $haystack[$j])) {
$j++;
$i = 0;
}
if (!isset($haystack[$j])) {
break;
}
$i++;
$j++;
}
if (!isset($needle[$i])) {
echo "YES";
}
else{
echo "NO ";
}
The short-hand version
$result = false!==strpos($a, 'are');
In order to find a 'word', rather than the occurrence of a series of letters that could in fact be a part of another word, the following would be a good solution.
$string = 'How are you?';
$array = explode(" ", $string);
if (in_array('are', $array) ) {
echo 'Found the word';
}
It can be done in three different ways:
$a = 'How are you?';
1- stristr()
if (strlen(stristr($a,"are"))>0) {
echo "true"; // are Found
}
2- strpos()
if (strpos($a, "are") !== false) {
echo "true"; // are Found
}
3- preg_match()
if( preg_match("are",$a) === 1) {
echo "true"; // are Found
}
$a = 'how are you';
if (strpos($a,'are')) {
echo 'true';
}
You need to use identical/not identical operators because strpos can return 0 as it's index value. If you like ternary operators, consider using the following (seems a little backwards I'll admit):
echo FALSE === strpos($a,'are') ? 'false': 'true';
The strpos function works fine, but if you want to do
case-insensitive
checking for a word in a paragraph then you can make use of the stripos
function of PHP
.
For example,
$result = stripos("I love PHP, I love PHP too!", "php");
if ($result === false) {
// Word does not exist
}
else {
// Word exists
}
Find the position of the first occurrence of a case-insensitive substring in a string.
If the word doesn't exist in the string then it will return false else it will return the position of the word.
0 comments:
Post a Comment