Connect with us

Technology

There will soon be a “substantially better” and bigger addition to one of the biggest AI training databases in the world

Published

on

There will soon be a substantially better and bigger addition to one of the biggest AI training databases in the world

Huge corpora of AI training data have been referred to as “the backbone of large language models.” But in 2023, amid a growing outcry over the ethical and legal implications of the datasets that trained the most well-known LLMs, such as OpenAI’s GPT-4 and Meta’s Llama, EleutherAI—the company that produced one of the largest of these datasets in the world—became a target. The Pile is an 825 GB open-sourced diverse text corpora.

One of the numerous cases centered on generative AI last year involved EleutherAI, a grassroots nonprofit research group that started as a loosely organized Discord collective in 2020 and aimed to comprehend how OpenAI’s new GPT-3 functioned. In October, a lawsuit was launched by former governor of Arkansas Mike Huckabee and other authors, claiming that their books were taken without permission and added to Books3, a contentious dataset that was a part of the Pile project and featured over 180,000 titles.

However, EleutherAI is now updating the Pile dataset in partnership with other institutions, such as the University of Toronto and the Allen Institute for AI, in addition to independent researchers, and isn’t even close to finishing their dataset work. The new Pile dataset won’t be finalized for a few months, according to Stella Biderman, an executive director of EleutherAI and lead scientist and mathematician at Booz Allen Hamilton, and Aviya Skowron, head of policy and ethics at EleutherAI, in a joint interview with VentureBeat.

It is anticipated that the new Pile will be “substantially better” and larger

According to Biderman, the upcoming LLM training dataset is anticipated to surpass the previous one in size and quality, making it “substantially better.”

“There’s going to be a lot of new data,” said Biderman. Some, she said, will be data that has not been seen anywhere before and “that we’re working on kind of excavating, which is going to be really exciting.”

Compared to the original dataset, which was made available in December 2020 and used to develop language models such as the Pythia suite and Stability AI’s Stable LM suite, the Pile v2 has more recent data. It will additionally have improved preprocessing: Biderman clarified, “We had never trained an LLM before we made the Pile.” “We’ve trained nearly a dozen people now, and we know a lot more about how to clean data so that LLMs can use it.”

Better-quality and more varied data will also be included in the new dataset. She stated, “For example, we’re going to have a lot more books than the original Pile had and a more diverse representation of non-academic non-fiction domains.”

In addition to Books3, the original Pile has 22 sub-datasets, including Wikipedia, YouTube subtitles, PubMed Central, Arxiv, Stack Exchange, and, oddly enough, Enron emails. Biderman noted that the Pile is still the world’s LLM training dataset with the best creator documentation. The goal in creating the Pile was to create a massive new dataset with billions of text passages that would be comparable in size to the one that OpenAI used to train GPT-3.

When The Pile was first made public, it was a distinct AI training dataset

“Back in 2020, the Pile was a very important thing, because there wasn’t anything quite like it,” Biderman remarked. She clarified that at the time, Google was using one publicly accessible huge text corpus, C4, to train a range of language models.

“But C4 is not nearly as big as the Pile is and it’s also a lot less diverse,” she said. “It’s a really high-quality Common Crawl scrape.”

Rather, EleutherAI aimed to be more discriminating, defining the kinds of data and subjects that it intended the model to be knowledgeable about.

“That was not really something anyone had ever done before,” she explained. “75%-plus of the Pile was chosen from specific topics or domains, where we wanted the model to know things about it — let’s give it as much meaningful information as we can about the world, about things we care about.”

EleutherAI’s “general position is that model training is fair use” for copyrighted material, according to Skowron. However, they noted that “no large language model on the market is currently not trained on copyrighted data,” and that one of the Pile v2 project’s objectives is to try and overcome some of the problems with copyright and data licensing.

To represent that work, they went into detail about how the new Pile dataset was put together: Text licensed under Creative Commons, code under open source licenses, text with licenses that explicitly permit redistribution and reuse (some open access scientific articles fall into this category), text that was never within the scope of copyright in the first place, such as government documents or legal filings (such as opinions from the Supreme Court), and smaller datasets for which researchers have the express permission of the rights holders are all included in the public domain.

Following ChatGPT, criticism of AI training datasets gained traction.

The influence of AI training datasets has long been a source of concern. For instance, a 2018 study co-authored by AI researchers Joy Buolamwini and Timnit Gebru revealed that racial prejudice in AI systems was caused by big image collections. Not shortly after the public realized that well-known text-to-image generators like Midjourney and Stable Diffusion were trained on enormous image datasets that were primarily scraped from the internet, legal disputes over big image training datasets started to develop around mid-2022.

But after OpenAI’s ChatGPT was released in November 2022, criticism of the datasets used to train LLMs and image generators has increased significantly, especially with regard to copyright issues. Following a wave of lawsuits centered on generative AI from authors, publishers, and artists, the New York Times filed a lawsuit against Microsoft and OpenAI last month. Many people think this case will eventually make its way to the Supreme Court.

However, there have also been more grave and unsettling allegations lately. These include the fact that the LAION-5B image dataset was removed last month due to the discovery of thousands of images of child sexual abuse and the ease with which deepfake revenge porn could be produced thanks to the large image corpora that trained text-to-image models.

The discussion surrounding AI training data is quite intricate and subtle

According to Biderman and Skowron, the discussion surrounding AI training data is significantly more intricate and multifaceted than what the media and opponents of AI portray it as.

For example, Biderman stated that it is difficult to properly remove the photographs since the approach employed by the individuals who highlighted the LAION content is not legally accessible to the LAION organization. Furthermore, it’s possible that there aren’t enough resources to prescreen data sets for this type of photography.

“There seems to be a very big disconnect between the way organizations try to fight this content and what would make their resources useful to people who wanted to screen data sets,” she said.

When it comes to other concerns, such as the impact on creative workers whose work was used to train AI models, “a lot of them are upset and hurt,” said Biderman. “I totally understand where they’re coming from that perspective.” But she pointed out that some creatives uploaded work to the internet under permissive licenses without knowing that years later AI training datasets could use the work under those licenses, including Common Crawl.

“I think a lot of people in the 2010s, if they had a magic eight ball, would have made different licensing decisions,” she said.

However, EleutherAI lacked a magic eight ball as well. Biderman and Skowron concur that at the time the Pile was established, AI training datasets were mostly utilized for research, where licensing and copyright exemptions are somewhat extensive.

“AI technologies have very recently made a jump from something that would be primarily considered a research product and a scientific artifact to something whose primary purpose was for fabrication,” Biderman said. Google had put some of these models into commercial use in the back end in the past, she explained, but training on “very large, mostly web script data sets, this became a question very recently.”

To be fair, Skowron pointed out, legal experts like as Ben Sobel have been considering AI-related concerns as well as the legal question of “fair use” for many years. But even those at OpenAI, “who you’d think would be in the know about the product pipeline,” did not comprehend the public, commercial implications of ChatGPT that was coming down the pike, they continued.

Open datasets are safer to utilize, according to EleutherAI

While it may seem contradictory to some, Biderman and Skowron also claim that AI models trained on open datasets like the Pile are safer to use, because visibility into the data is what permits the resulting AI models to be securely and ethically used in a range of scenarios.

“There needs to be much more visibility in order to achieve many policy objectives or ethical ideals that people want,” said Skowron, including thorough documentation of the training at the very minimum. “And for many research questions you need actual access to the data sets, including those that are very much of, of interest to copyright holders such as such as memorization.”

For the time being, Biderman, Skowron, and EleutherAI colleagues keep working on the Pile’s update.

“It’s been a work in progress for about a year and a half and it’s been a meaningful work in progress for about two months — I am optimistic that we will train and release models this year,” said Biderman. “I’m curious to see how big a difference this makes. If I had to guess…it will make a small but meaningful one.”

Technology

Threads uses a more sophisticated search to compete with Bluesky

Published

on

Instagram Threads, a rival to Meta’s X, will have an enhanced search experience, the firm said Monday. The app, which is based on Instagram’s social graph and provides a Meta-run substitute for Elon Musk’s X, is introducing a new feature that lets users search for certain posts by date ranges and user profiles.

Compared to X’s advanced search, which now allows users to refine queries by language, keywords, exact phrases, excluded terms, hashtags, and more, this is less thorough. However, it does make it simpler for users of Threads to find particular messages. Additionally, it will make Threads’ search more comparable to Bluesky’s, which also lets users use sophisticated queries to restrict searches by user profiles, date ranges, and other criteria. However, not all of the filtering options are yet visible in the Bluesky app’s user interface.

In order to counter the danger posed by social networking startup Bluesky, which has quickly gained traction as another X competitor, Meta has started launching new features in quick succession in recent days. Bluesky had more than 9 million users in September, but in the weeks after the U.S. elections, users left X due to Elon Musk’s political views and other policy changes, including plans to alter the way blocks operate and let AI companies train on X user data. According to Bluesky, there are currently around 24 million users.

Meta’s Threads introduced new features to counter Bluesky’s potential, such as an improved algorithm, a design modification that makes switching between feeds easier, and the option for users to select their own default feed. Additionally, it was observed creating Starter Packs, its own version of Bluesky’s user-curated recommendation lists.

Continue Reading

Technology

Apple’s own 5G modem-equipped iPhone SE 4 is “confirmed” to launch in March

Published

on

Tom O’Malley, an analyst at Barclays, recently visited Asia with his colleagues to speak with suppliers and makers of electronics. The analysts said they had “confirmed” that a fourth-generation iPhone SE with an Apple-designed 5G modem is scheduled to launch near the end of the first quarter next year in a research note they released this week that outlines the main conclusions from the trip. That timeline implies that the next iPhone SE will be unveiled in March, similar to when the present model was unveiled in 2022, in keeping with earlier rumors.

The rumored features of the fourth-generation iPhone SE include a 6.1-inch OLED display, Face ID, a newer A-series chip, a USB-C port, a single 48-megapixel rear camera, 8GB of RAM to enable Apple Intelligence support, and the previously mentioned Apple-designed 5G modem. The SE is anticipated to have a similar design to the base iPhone 14.

Since 2018, Apple is said to have been developing its own 5G modem for iPhones, a move that will let it lessen and eventually do away with its reliance on Qualcomm. With Qualcomm’s 5G modem supply arrangement for iPhone launches extended through 2026 earlier this year, Apple still has plenty of time to finish switching to its own modem. In addition to the fourth-generation iPhone SE, Apple analyst Ming-Chi Kuo earlier stated that the so-called “iPhone 17 Air” would come with a 5G modem that was created by Apple.

Whether Apple’s initial 5G modem would offer any advantages to consumers over Qualcomm’s modems, such quicker speeds, is uncertain.

Qualcomm was sued by Apple in 2017 for anticompetitive behavior and $1 billion in unpaid royalties. In 2019, Apple purchased the majority of Intel’s smartphone modem business after the two firms reached a settlement in the dispute. Apple was able to support its development by acquiring a portfolio of patents relating to cellular technology. It appears that we will eventually be able to enjoy the results of our effort in four more months.

On March 8, 2022, Apple made the announcement of the third-generation iPhone SE online. With antiquated features like a Touch ID button, a Lightning port, and large bezels surrounding the screen, the handset resembles the iPhone 8. The iPhone SE presently retails for $429 in the United States, but the new model may see a price increase of at least a little.

Continue Reading

Technology

Google is said to be discontinuing the Pixel Tablet 2 and may be leaving the market once more

Published

on

Google terminated the development of the Pixel Tablet 3 yesterday, according to Android Headlines, even before a second-generation model was announced. The second-generation Pixel Tablet has actually been canceled, according to the report. This means that the gadget that was released last year will likely be a one-off, and Google is abandoning the tablet market for the second time in just over five years.

If accurate, the report indicates that Google has determined that it is not worth investing more money in a follow-up because of the dismal sales of the Pixel Tablet. Rumors of a keyboard accessory and more functionality for the now-defunct project surfaced as recently as last week.

It’s important to keep in mind that Google’s Nest subsidiary may abandon its plans for large-screen products in favor of developing technologies like the Nest Hub and Hub Max rather than standalone tablets.

Google has always had difficulty making a significant impact in the tablet market and creating a competitor that can match Apple’s iPad in terms of sales and general performance, not helped in the least by its inconsistent approach. Even though the hardware was good, it never really fought back after getting off to a promising start with the Nexus 7 eons ago. Another problem that has hampered Google’s efforts is that Android significantly trails iPadOS in terms of the quantity of third-party apps that are tablet-optimized.

After the Pixel Slate received tremendously unfavorable reviews, the firm first declared that it was finished producing tablets in 2019. Two tablets that were still in development at the time were discarded.

By 2022, however, Google had altered its mind and declared that a tablet was being developed by its Pixel hardware team. The $499 Pixel Tablet was the final version of the gadget, which came with a speaker dock that the tablet could magnetically connect to. (Google would subsequently charge $399 for the tablet alone.)

Continue Reading

Trending

error: Content is protected !!