Monday, 8 October 2018

How do I check if a string contains a specific word?

Answers


You can use the strpos() function which is used to find the occurrence of one string inside another one:
$a = 'How are you?';

if (strpos($a, 'are') !== false) {
    echo 'true';
}
Note that the use of !== false is deliberate; strpos() returns either the offset at which the needle string begins in the haystack string, or the boolean false if the needle isn't found. Since 0 is a valid offset and 0 is "falsey", we can't use simpler constructs like !strpos($a, 'are').



Use strpos function:
if (strpos($a, 'are') !== false)
    echo 'true';



While most of these answers will tell you if a substring appears in your string, that's usually not what you want if you're looking for a particular word, and not a substring.
What's the difference? Substrings can appear within other words:
  • The "are" at the beginning of "area"
  • The "are" at the end of "hare"
  • The "are" in the middle of "fares"
One way to mitigate this would be to use a regular expression coupled with word boundaries (\b):
function containsWord($str, $word)
{
    return !!preg_match('#\\b' . preg_quote($word, '#') . '\\b#i', $str);
}
This method doesn't have the same false positives noted above, but it does have some edge cases of its own. Word boundaries match on non-word characters (\W), which are going to be anything that isn't a-zA-Z0-9, or _. That means digits and underscores are going to be counted as word characters and scenarios like this will fail:
  • The "are" in "What _are_ you thinking?"
  • The "are" in "lol u dunno wut those are4?"
If you want anything more accurate than this, you'll have to start doing English language syntax parsing, and that's a pretty big can of worms (and assumes proper use of syntax, anyway, which isn't always a given).



Using strstr() or stristr() if your search should be case insensitive would be another option.



If you want to avoid the "falsey" and "truthy" problem, you can use substr_count:
if (substr_count($a, 'are') > 0) {
    echo "at least one 'are' is present!";
}
It's a bit slower than strpos but it avoids the comparison problems.



Peer to SamGoody and Lego Stormtroopr comments.
If you are looking for a PHP algorithm to rank search results based on proximity/relevance of multiple words here comes a quick and easy way of generating search results with PHP only:
Issues with the other boolean search methods such as strpos()preg_match()strstr() or stristr()
  1. can't search for multiple words
  2. results are unranked
It sounds difficult but is surprisingly easy.
If we want to search for multiple words in a string the core problem is how we assign a weight to each one of them?
If we could weight the terms in a string based on how representative they are of the string as a whole, we could order our results by the ones that best match the query.
This is the idea of the vector space model, not far from how SQL full-text search works:
function get_corpus_index($corpus = array(), $separator=' ') {

    $dictionary = array();

    $doc_count = array();

    foreach($corpus as $doc_id => $doc) {

        $terms = explode($separator, $doc);

        $doc_count[$doc_id] = count($terms);

        // tf–idf, short for term frequency–inverse document frequency, 
        // according to wikipedia is a numerical statistic that is intended to reflect 
        // how important a word is to a document in a corpus

        foreach($terms as $term) {

            if(!isset($dictionary[$term])) {

                $dictionary[$term] = array('document_frequency' => 0, 'postings' => array());
            }
            if(!isset($dictionary[$term]['postings'][$doc_id])) {

                $dictionary[$term]['document_frequency']++;

                $dictionary[$term]['postings'][$doc_id] = array('term_frequency' => 0);
            }

            $dictionary[$term]['postings'][$doc_id]['term_frequency']++;
        }

        //from http://phpir.com/simple-search-the-vector-space-model/

    }

    return array('doc_count' => $doc_count, 'dictionary' => $dictionary);
}

function get_similar_documents($query='', $corpus=array(), $separator=' '){

    $similar_documents=array();

    if($query!=''&&!empty($corpus)){

        $words=explode($separator,$query);

        $corpus=get_corpus_index($corpus, $separator);

        $doc_count=count($corpus['doc_count']);

        foreach($words as $word) {

            if(isset($corpus['dictionary'][$word])){

                $entry = $corpus['dictionary'][$word];


                foreach($entry['postings'] as $doc_id => $posting) {

                    //get term frequency–inverse document frequency
                    $score=$posting['term_frequency'] * log($doc_count + 1 / $entry['document_frequency'] + 1, 2);

                    if(isset($similar_documents[$doc_id])){

                        $similar_documents[$doc_id]+=$score;

                    }
                    else{

                        $similar_documents[$doc_id]=$score;

                    }
                }
            }
        }

        // length normalise
        foreach($similar_documents as $doc_id => $score) {

            $similar_documents[$doc_id] = $score/$corpus['doc_count'][$doc_id];

        }

        // sort from  high to low

        arsort($similar_documents);

    }   

    return $similar_documents;
}
CASE 1
$query = 'are';

$corpus = array(
    1 => 'How are you?',
);

$match_results=get_similar_documents($query,$corpus);
echo '<pre>';
    print_r($match_results);
echo '</pre>';
RESULT
Array
(
    [1] => 0.52832083357372
)
CASE 2
$query = 'are';

$corpus = array(
    1 => 'how are you today?',
    2 => 'how do you do',
    3 => 'here you are! how are you? Are we done yet?'
);

$match_results=get_similar_documents($query,$corpus);
echo '<pre>';
    print_r($match_results);
echo '</pre>';
RESULTS
Array
(
    [1] => 0.54248125036058
    [3] => 0.21699250014423
)
CASE 3
$query = 'we are done';

$corpus = array(
    1 => 'how are you today?',
    2 => 'how do you do',
    3 => 'here you are! how are you? Are we done yet?'
);

$match_results=get_similar_documents($query,$corpus);
echo '<pre>';
    print_r($match_results);
echo '</pre>';
RESULTS
Array
(
    [3] => 0.6813781191217
    [1] => 0.54248125036058
)
There are plenty of improvements to be made but the model provides a way of getting good results from natural queries, which don't have boolean operators such as strpos()preg_match()strstr() or stristr().
NOTA BENE
Optionally eliminating redundancy prior to search the words
  • thereby reducing index size and resulting in less storage requirement
  • less disk I/O
  • faster indexing and a consequently faster search.
1. Normalisation
  • Convert all text to lower case
2. Stopword elimination
  • Eliminate words from the text which carry no real meaning (like 'and', 'or', 'the', 'for', etc.)
3. Dictionary substitution
  • Replace words with others which have an identical or similar meaning. (ex:replace instances of 'hungrily' and 'hungry' with 'hunger')
  • Further algorithmic measures (snowball) may be performed to further reduce words to their essential meaning.
  • The replacement of colour names with their hexadecimal equivalents
  • The reduction of numeric values by reducing precision are other ways of normalising the text.
RESOURCES



I'm a bit impressed that none of the answers here that used strposstrstr and similar functions mentioned Multibyte String Functions yet (2015-05-08).
Basically, if you're having trouble finding words with characters specific to some languages, such as German, French, Portuguese, Spanish, etc. (e.g.: äéôçºñ), you may want to precede the functions with mb_. Therefore, the accepted answer would use mb_strpos or mb_stripos (for case-insensitive matching) instead:
if (mb_strpos($a,'are') !== false) {
    echo 'true';
}
If you cannot guarantee that all your data is 100% in UTF-8, you may want to use the mb_ functions.



if (preg_match('are', $a)) {
   echo 'true';
}



You can use the strstr function:
$haystack = "I know programming";
$needle   = "know";
$flag = strstr($haystack, $needle);

if ($flag){

    echo "true";
}
Without using an inbuilt function:
$haystack  = "hello world";
$needle = "llo";

$i = $j = 0;

while (isset($needle[$i])) {
    while (isset($haystack[$j]) && ($needle[$i] != $haystack[$j])) {
        $j++;
        $i = 0;
    }
    if (!isset($haystack[$j])) {
        break;
    }
    $i++;
    $j++;

}
if (!isset($needle[$i])) {
    echo "YES";
}
else{
    echo "NO ";
}



The short-hand version
$result = false!==strpos($a, 'are');



In order to find a 'word', rather than the occurrence of a series of letters that could in fact be a part of another word, the following would be a good solution.
$string = 'How are you?';
$array = explode(" ", $string);

if (in_array('are', $array) ) {
    echo 'Found the word';
}



It can be done in three different ways:
 $a = 'How are you?';
1- stristr()
 if (strlen(stristr($a,"are"))>0) {
    echo "true"; // are Found
 } 
2- strpos()
 if (strpos($a, "are") !== false) {
   echo "true"; // are Found
 }
3- preg_match()
 if( preg_match("are",$a) === 1) {
   echo "true"; // are Found
 }



$a = 'how are you';
if (strpos($a,'are')) {
    echo 'true';
}



You need to use identical/not identical operators because strpos can return 0 as it's index value. If you like ternary operators, consider using the following (seems a little backwards I'll admit):
echo FALSE === strpos($a,'are') ? 'false': 'true';



The strpos function works fine, but if you want to do case-insensitive checking for a word in a paragraph then you can make use of the stripos function of PHP.
For example,
$result = stripos("I love PHP, I love PHP too!", "php");
if ($result === false) {
    // Word does not exist
}
else {
    // Word exists
}
Find the position of the first occurrence of a case-insensitive substring in a string.
If the word doesn't exist in the string then it will return false else it will return the position of the word.

0 comments:

Post a Comment